genomicarrays package¶
Submodules¶
genomicarrays.GenomicArrayDataset module¶
Query the GenomicArrayDataset.
This class provides methods to access the directory containing the
generated TileDB files usually using the
build_genomicarray()
.
Example
from genomicarray import (
GenomicArrayDataset,
)
garr = GenomicArrayDataset(
dataset_path="/path/to/genomicarray/dir"
)
result1 = garr[
0:10, 0
]
print(result1)
- class genomicarrays.GenomicArrayDataset.GenomicArrayDataset(dataset_path, matrix_tdb_uri='coverage', feature_annotation_uri='feature_annotation', sample_metadata_uri='sample_metadata')[source]¶
Bases:
object
A class that represent a collection of features and their associated coverage in a TileDB backed store.
- __getitem__(args)[source]¶
Subset a
GenomicArrayDataset
.Mostly an alias to
get_slice()
.- Parameters:
args (
Union
[int
,Sequence
,tuple
]) –Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be extracted.
Alternatively a tuple of length 1. The first entry specifies the rows (or cells) to retain based on their names or indices.
Alternatively a tuple of length 2. The first entry specifies the rows (or cells) to retain, while the second entry specifies the columns (or features/genes) to retain, based on their names or indices.
- Raises:
ValueError – If too many or too few slices provided.
- Return type:
- Returns:
A
GenomicArrayDatasetSlice
object containing the sample_metadata, feature_annotation and the matrix.
- __init__(dataset_path, matrix_tdb_uri='coverage', feature_annotation_uri='feature_annotation', sample_metadata_uri='sample_metadata')[source]¶
Initialize a
GenomicArrayDataset
.- Parameters:
dataset_path (
str
) – Path to the directory containing the TileDB stores. Usually theoutput_path
from thebuild_genomicarray()
.matrix_tdb_uri (
str
) – Relative path to matrix store.feature_annotation_uri (
str
) – Relative path to feature annotation store.sample_metadata_uri (
str
) – Relative path to sample metadata store.
- get_feature_annotation_column(column_name)[source]¶
Access a column from the
feature_annotation
store.- Parameters:
column_name (
str
) – Name of the column or attribute. Usually one of the column names from ofget_feature_annotation_columns()
.- Return type:
- Returns:
A list of values for this column.
- get_feature_annotation_columns()[source]¶
Get annotation column names from
feature_annotation
store.
- get_feature_subset(subset, columns=None)[source]¶
Slice the
feature_annotation
store.- Parameters:
subset (
Union
[slice
,List
[str
],QueryCondition
]) –A list of integer indices to subset the
feature_annotation
store.Alternatively, may provide a
tiledb.QueryCondition
to query the store.Alternatively, may provide a list of strings to match with the index of
feature_annotation
store.columns –
List of specific column names to access.
Defaults to None, in which case all columns are extracted.
- Return type:
- Returns:
A pandas Dataframe of the subset.
- get_sample_metadata_column(column_name)[source]¶
Access a column from the
sample_metadata
store.- Parameters:
column_name (
str
) – Name of the column or attribute. Usually one of the column names from ofget_sample_metadata_columns()
.- Return type:
- Returns:
A list of values for this column.
- get_sample_subset(subset, columns=None)[source]¶
Slice the
sample_metadata
store.- Parameters:
- Return type:
- Returns:
A pandas Dataframe of the subset.
- get_slice(feature_subset, sample_subset)[source]¶
Subset a
GenomicArrayDataset
.- Parameters:
sample_subset (
Union
[slice
,List
[str
],QueryCondition
]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the columns (or samples) to retain.feature_subset (
Union
[slice
,int
]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the rows (or features/genes) to retain.
- Return type:
- Returns:
A
GenomicArrayDatasetSlice
object containing the sample_metadata, feature_annotation and the matrix for the given slice ranges.
- property shape¶
genomicarrays.GenomicArrayDatasetSlice module¶
Class that represents a realized subset of the GenomicArrayDataset.
This class provides a slice data class usually generated by the access
methods from
GenomicArrayDataset()
.
Example
from genomicarray import (
GenomicArrayDataset,
)
garr = GenomicArrayDataset(
dataset_path="/path/to/genomicarray/dir"
)
feature_indices = (
slice(0, 10)
)
result1 = garr[
feature_indices,
0,
]
print(result1)
- class genomicarrays.GenomicArrayDatasetSlice.GenomicArrayDatasetSlice(sample_metadata, feature_annotation, matrix)[source]¶
Bases:
object
Class that represents a realized subset of the CellArrDataset.
- __annotations__ = {'feature_annotation': <class 'pandas.core.frame.DataFrame'>, 'matrix': typing.Any, 'sample_metadata': <class 'pandas.core.frame.DataFrame'>}¶
- __dataclass_fields__ = {'feature_annotation': Field(name='feature_annotation',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix': Field(name='matrix',type=typing.Any,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'sample_metadata': Field(name='sample_metadata',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(sample_metadata, feature_annotation, matrix)¶
- __match_args__ = ('sample_metadata', 'feature_annotation', 'matrix')¶
- property shape¶
genomicarrays.build_genomicarray module¶
Build the GenomicArrayDatset.
This modules provides tools for converting genomic data from BigWig format to TileDB. It supports parallel processing for handling large collection of genomic datasets.
Example
import pyBigWig as bw
import numpy as np
import tempfile
from genomicarrays import build_genomicarray, MatrixOptions
# Create a temporary directory
tempdir = tempfile.mkdtemp()
# Read BigWig objects
bw1 = bw.open("path/to/object1.bw", "r")
# or just provide the path
bw2 = "path/to/object2.bw"
features = pd.DataFrame({
"seqnames": ["chr1", "chr1"],
"starts": [1000, 2000],
"ends": [1500, 2500]
})
# Build GenomicArray
dataset = build_genomicarray(
features=features
output_path=tempdir,
files=[bw1, bw2],
matrix_options=MatrixOptions(dtype=np.float32),
)
- genomicarrays.build_genomicarray.build_genomicarray(files, output_path, features, genome_fasta, sample_metadata=None, sample_metadata_options=SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None), matrix_options=MatrixOptions(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4), feature_annotation_options=FeatureAnnotationOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types={'seqnames': 'ascii', 'starts': 'int', 'ends': 'int'}, aggregate_function=None, expected_agg_function_length=1), optimize_tiledb=True, num_threads=1)[source]¶
Generate the GenomicArrayDatset.
All files are expected to be consistent and any modifications to make them consistent is outside the scope of this function and package.
- Parameters:
files (
list
) – List of file paths to BigWig files.output_path (
str
) – Path to where the output TileDB files should be stored.features (
Union
[str
,DataFrame
]) –A
DataFrame
containing the input genomic intervals..Alternatively, may provide path to the file containing a list of intervals. In this case, the first row is expected to contain the column names, “seqnames”, “starts” and “ends”.
Additionally, the file may contain a column sequences, to specify the sequence string for each region. Otherwise, provide a link to the fasta file using the
genome_fasta
parameter.genome_fasta (
str
) –Path to a fasta file containing the sequence information.
Sequence information will be updated for each region in
features
if the DataFrame/path does not contain a column sequences.sample_metadata (
Union
[DataFrame
,str
]) –A
DataFrame
containing the sample metadata for each file infiles
. Hences the number of rows in the dataframe must match the number offiles
.Alternatively, may provide path to the file containing a concatenated sample metadata across all BigWig files. In this case, the first row is expected to contain the column names.
Additionally, the order of rows is expected to be in the same order as the input list of
files
.Defaults to None, in which case, we create a simple sample metadata dataframe containing the list of datasets, aka each BigWig files. Each dataset is named as
sample_{i}
where i refers to the index position of the object infiles
.sample_metadata_options (
SampleMetadataOptions
) – Optional parameters when generatingsample_metadata
store.matrix_options (
MatrixOptions
) – Optional parameters when generatingmatrix
store.feature_annotation_options (
FeatureAnnotationOptions
) – Optional parameters when generatingfeature_annotation
store.optimize_tiledb (
bool
) – Whether to run TileDB’s vaccum and consolidation (may take long).num_threads (
int
) – Number of threads. Defaults to 1.
genomicarrays.build_options module¶
- class genomicarrays.build_options.FeatureAnnotationOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types=None, aggregate_function=None, expected_agg_function_length=1)[source]¶
Bases:
object
Optional arguments for the
feature
store forbuild_genomicarray()
.- skip¶
Whether to skip generating sample TileDB. Defaults to False.
- dtype¶
NumPy dtype for the sample dimension. Defaults to np.uint32.
Note: make sure the number of features fit within the integer limits of chosen dtype.
- tiledb_store_name¶
Name of the TileDB file. Defaults to “feature_annotation”.
- column_types¶
A dictionary containing column names as keys and the value representing the type to in the TileDB.
If None, all columns are cast as ‘ascii’.
- aggregate_function¶
A callable to summarize the values in a given interval. The aggregate function is expected to return either a scalar value or a 1-dimensional NumPy ndarray.
Defaults to None.
- expected_agg_function_length¶
Length of the output when a agg function is applied to an interval. Defaults to 1, expecting a scalar.
Note: ndarrays will be flattenned before writing to TileDB.
- __annotations__ = {'aggregate_function': typing.Optional[typing.Callable], 'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'expected_agg_function_length': <class 'int'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'aggregate_function': Field(name='aggregate_function',type=typing.Optional[typing.Callable],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'expected_agg_function_length': Field(name='expected_agg_function_length',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='feature_annotation',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types=None, aggregate_function=None, expected_agg_function_length=1)¶
- __match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types', 'aggregate_function', 'expected_agg_function_length')¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
uint32
- class genomicarrays.build_options.MatrixOptions(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4)[source]¶
Bases:
object
Optional arguments for the
matrix
store forbuild_genomicarray()
.- matrix_attr_name¶
Name of the matrix to be stored in the TileDB file. Defaults to “data”.
- skip¶
Whether to skip generating matrix TileDB. Defaults to False.
- dtype¶
NumPy dtype for the values in the matrix. Defaults to np.uint16.
Note: make sure the matrix values fit within the range limits of chosen-dtype.
- tiledb_store_name¶
Name of the TileDB file. Defaults to coverage.
- chunk_size¶
Size of chunks for parallel processing.
- compression¶
TileDB compression filter (None, ‘gzip’, ‘zstd’, ‘lz4’).
- compression_level¶
Compression level (1-9).
- __annotations__ = {'chunk_size': <class 'int'>, 'compression': typing.Literal['zstd', 'gzip', 'lz4'], 'compression_level': <class 'int'>, 'dtype': <class 'numpy.dtype'>, 'matrix_attr_name': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'chunk_size': Field(name='chunk_size',type=<class 'int'>,default=1000,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'compression': Field(name='compression',type=typing.Literal['zstd', 'gzip', 'lz4'],default='zstd',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'compression_level': Field(name='compression_level',type=<class 'int'>,default=4,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.float32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix_attr_name': Field(name='matrix_attr_name',type=<class 'str'>,default='data',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='coverage',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4)¶
- __match_args__ = ('skip', 'matrix_attr_name', 'dtype', 'tiledb_store_name', 'chunk_size', 'compression', 'compression_level')¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
float32
- class genomicarrays.build_options.SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)[source]¶
Bases:
object
Optional arguments for the
sample
store forbuild_genomicarray()
.- skip¶
Whether to skip generating sample TileDB. Defaults to False.
- dtype¶
NumPy dtype for the sample dimension. Defaults to np.uint32.
Note: make sure the number of samples fit within the integer limits of chosen dtype.
- tiledb_store_name¶
Name of the TileDB file. Defaults to “sample_metadata”.
- column_types¶
A dictionary containing column names as keys and the value representing the type to in the TileDB.
If None, all columns are cast as ‘ascii’.
- __annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}¶
- __dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='sample_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}¶
- __dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)¶
- __eq__(other)¶
Return self==value.
- __hash__ = None¶
- __init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)¶
- __match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types')¶
- __repr__()¶
Return repr(self).
- dtype¶
alias of
uint32
genomicarrays.buildutils_tiledb_array module¶
- genomicarrays.buildutils_tiledb_array.create_tiledb_array(tiledb_uri_path, x_dim_length=None, y_dim_length=None, x_dim_name='feature_index', y_dim_name='sample_index', matrix_attr_name='data', x_dim_dtype=<class 'numpy.uint32'>, y_dim_dtype=<class 'numpy.uint32'>, matrix_dim_dtype=<class 'numpy.uint32'>, is_sparse=True)[source]¶
Create a TileDB file with the provided attributes to persistent storage.
This will materialize the array directory and all related schema files.
- Parameters:
tiledb_uri_path (
str
) – Path to create the array TileDB file.x_dim_length (
int
) – Number of entries along the x/fastest-changing dimension. e.g. Number of cells. Defaults to None, in which case, the max integer value ofx_dim_dtype
is used.y_dim_length (
int
) – Number of entries along the y dimension. e.g. Number of genes. Defaults to None, in which case, the max integer value ofy_dim_dtype
is used.x_dim_name (
str
) – Name for the x-dimension. Defaults to “feature_index”.y_dim_name (
str
) – Name for the y-dimension. Defaults to “sample_index”.matrix_attr_name (
str
) – Name for the attribute in the array. Defaults to “data”.x_dim_dtype (
dtype
) – NumPy dtype for the x-dimension. Defaults to np.uint32.y_dim_dtype (
dtype
) – NumPy dtype for the y-dimension. Defaults to np.uint32.matrix_dim_dtype (
dtype
) – NumPy dtype for the values in the matrix. Defaults to np.uint32.is_sparse (
bool
) – Whether the matrix is sparse. Defaults to True.
- genomicarrays.buildutils_tiledb_array.optimize_tiledb_array(tiledb_array_uri, verbose=True)[source]¶
Consolidate TileDB fragments.
- genomicarrays.buildutils_tiledb_array.write_array_chunks_to_tiledb(tiledb_array_uri, data, x_idx, y_idx, value_dtype=<class 'numpy.uint32'>)[source]¶
Write chunks of array to the tiledb.
genomicarrays.dataloader module¶
genomicarrays.queryutils_tiledb_frame module¶
- genomicarrays.queryutils_tiledb_frame.get_a_column(tiledb_obj, column_name)[source]¶
Access column(s) from the TileDB object.
- genomicarrays.queryutils_tiledb_frame.get_index(tiledb_obj)[source]¶
Get the index of the TileDB object.
- Parameters:
tiledb_obj (
Array
) – A TileDB object.- Return type:
- Returns:
A list containing the index values.
- genomicarrays.queryutils_tiledb_frame.get_schema_names_frame(tiledb_obj)[source]¶
Get Attributes from a TileDB object.
- genomicarrays.queryutils_tiledb_frame.subset_array(tiledb_obj, row_subset, column_subset, shape)[source]¶
Subset a tiledb storing array data.
Uses multi_index to slice.
genomicarrays.utils_bw module¶
- genomicarrays.utils_bw.extract_bw_intervals_as_vec(bw_path, intervals, total_length, val_dtype=<class 'numpy.float32'>)[source]¶
Extract data from BigWig for a given region.
- Parameters:
- Return type:
- Returns:
A vector with length as the number of intervals, a value if the file contains the data for the corresponding region or
np.nan
if the region is not measured.
- genomicarrays.utils_bw.extract_bw_values_as_vec(bw_path, intervals, total_length, agg_func=None, val_dtype=<class 'numpy.float32'>, outsize_per_feature=1)[source]¶
Extract data from BigWig for a given region and apply the aggregate function.
- Parameters:
bw_path (
str
) – Path to the BigWig file.intervals (
DataFrame
) – List of intervals to extract.agg_func (
Optional
[callable
]) – Aggregate function to apply. Defaults to None.val_dtype (
dtype
) – Dtype of the resulting array.total_length (
int
) – Size of all the regions.outsize_per_feature (
int
) – Expected length of output after applying theagg_func
.
- Return type:
- Returns:
A vector with length as number of intervals X outsize_per_feature, a value if the file contains the data for the corresponding region or
np.nan
if the region is not measured.