genomicarrays package

Submodules

genomicarrays.GenomicArrayDataset module

Query the GenomicArrayDataset.

This class provides methods to access the directory containing the generated TileDB files usually using the build_genomicarray().

Example

from genomicarray import (
    GenomicArrayDataset,
)

garr = GenomicArrayDataset(
    dataset_path="/path/to/genomicarray/dir"
)
result1 = garr[
    0:10, 0
]

print(result1)
class genomicarrays.GenomicArrayDataset.GenomicArrayDataset(dataset_path, matrix_tdb_uri='coverage', feature_annotation_uri='feature_annotation', sample_metadata_uri='sample_metadata')[source]

Bases: object

A class that represent a collection of features and their associated coverage in a TileDB backed store.

__del__()[source]
__getitem__(args)[source]

Subset a GenomicArrayDataset.

Mostly an alias to get_slice().

Parameters:

args (Union[int, Sequence, tuple]) –

Integer indices, a boolean filter, or (if the current object is named) names specifying the ranges to be extracted.

Alternatively a tuple of length 1. The first entry specifies the rows (or cells) to retain based on their names or indices.

Alternatively a tuple of length 2. The first entry specifies the rows (or cells) to retain, while the second entry specifies the columns (or features/genes) to retain, based on their names or indices.

Raises:

ValueError – If too many or too few slices provided.

Return type:

GenomicArrayDatasetSlice

Returns:

A GenomicArrayDatasetSlice object containing the sample_metadata, feature_annotation and the matrix.

__init__(dataset_path, matrix_tdb_uri='coverage', feature_annotation_uri='feature_annotation', sample_metadata_uri='sample_metadata')[source]

Initialize a GenomicArrayDataset.

Parameters:
  • dataset_path (str) – Path to the directory containing the TileDB stores. Usually the output_path from the build_genomicarray().

  • matrix_tdb_uri (str) – Relative path to matrix store.

  • feature_annotation_uri (str) – Relative path to feature annotation store.

  • sample_metadata_uri (str) – Relative path to sample metadata store.

__len__()[source]
__repr__()[source]
Return type:

str

Returns:

A string representation.

get_feature_annotation_column(column_name)[source]

Access a column from the feature_annotation store.

Parameters:

column_name (str) – Name of the column or attribute. Usually one of the column names from of get_feature_annotation_columns().

Return type:

DataFrame

Returns:

A list of values for this column.

get_feature_annotation_columns()[source]

Get annotation column names from feature_annotation store.

Return type:

List[str]

Returns:

List of available annotations.

get_feature_annotation_index()[source]

Get index of the feature_annotation store.

Return type:

List[str]

Returns:

List of feature ids.

get_feature_subset(subset, columns=None)[source]

Slice the feature_annotation store.

Parameters:
  • subset (Union[slice, List[str], QueryCondition]) –

    A list of integer indices to subset the feature_annotation store.

    Alternatively, may provide a tiledb.QueryCondition to query the store.

    Alternatively, may provide a list of strings to match with the index of feature_annotation store.

  • columns

    List of specific column names to access.

    Defaults to None, in which case all columns are extracted.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_matrix_subset(subset)[source]

Slice the matrix store.

Parameters:

subset (Union[int, Sequence, tuple]) – Any `slice`supported by TileDB’s array slicing. For more info refer to <TileDB docs https://docs.tiledb.com/main/how-to/arrays/reading-arrays/basic-reading>_.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_sample_metadata_column(column_name)[source]

Access a column from the sample_metadata store.

Parameters:

column_name (str) – Name of the column or attribute. Usually one of the column names from of get_sample_metadata_columns().

Return type:

DataFrame

Returns:

A list of values for this column.

get_sample_metadata_columns()[source]

Get column names from sample_metadata store.

Return type:

List[str]

Returns:

List of available metadata columns.

get_sample_subset(subset, columns=None)[source]

Slice the sample_metadata store.

Parameters:
  • subset (Union[slice, QueryCondition]) –

    A list of integer indices to subset the sample_metadata store.

    Alternatively, may also provide a tiledb.QueryCondition to query the store.

  • columns

    List of specific column names to access.

    Defaults to None, in which case all columns are extracted.

Return type:

DataFrame

Returns:

A pandas Dataframe of the subset.

get_slice(feature_subset, sample_subset)[source]

Subset a GenomicArrayDataset.

Parameters:
  • sample_subset (Union[slice, List[str], QueryCondition]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the columns (or samples) to retain.

  • feature_subset (Union[slice, int]) – Integer indices, a boolean filter, or (if the current object is named) names specifying the rows (or features/genes) to retain.

Return type:

GenomicArrayDatasetSlice

Returns:

A GenomicArrayDatasetSlice object containing the sample_metadata, feature_annotation and the matrix for the given slice ranges.

property shape

genomicarrays.GenomicArrayDatasetSlice module

Class that represents a realized subset of the GenomicArrayDataset.

This class provides a slice data class usually generated by the access methods from GenomicArrayDataset().

Example

from genomicarray import (
    GenomicArrayDataset,
)

garr = GenomicArrayDataset(
    dataset_path="/path/to/genomicarray/dir"
)
feature_indices = (
    slice(0, 10)
)
result1 = garr[
    feature_indices,
    0,
]

print(result1)
class genomicarrays.GenomicArrayDatasetSlice.GenomicArrayDatasetSlice(sample_metadata, feature_annotation, matrix)[source]

Bases: object

Class that represents a realized subset of the CellArrDataset.

__annotations__ = {'feature_annotation': <class 'pandas.core.frame.DataFrame'>, 'matrix': typing.Any, 'sample_metadata': <class 'pandas.core.frame.DataFrame'>}
__dataclass_fields__ = {'feature_annotation': Field(name='feature_annotation',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix': Field(name='matrix',type=typing.Any,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'sample_metadata': Field(name='sample_metadata',type=<class 'pandas.core.frame.DataFrame'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(sample_metadata, feature_annotation, matrix)
__len__()[source]
__match_args__ = ('sample_metadata', 'feature_annotation', 'matrix')
__repr__()[source]
Return type:

str

Returns:

A string representation.

feature_annotation: DataFrame
matrix: Any
sample_metadata: DataFrame
property shape

genomicarrays.build_genomicarray module

Build the GenomicArrayDatset.

This modules provides tools for converting genomic data from BigWig format to TileDB. It supports parallel processing for handling large collection of genomic datasets.

Example

import pyBigWig as bw
import numpy as np
import tempfile
from genomicarrays import build_genomicarray, MatrixOptions

# Create a temporary directory
tempdir = tempfile.mkdtemp()

# Read BigWig objects
bw1 = bw.open("path/to/object1.bw", "r")
# or just provide the path
bw2 = "path/to/object2.bw"

features = pd.DataFrame({
    "seqnames": ["chr1", "chr1"],
    "starts": [1000, 2000],
    "ends": [1500, 2500]
})

# Build GenomicArray
dataset = build_genomicarray(
    features=features
    output_path=tempdir,
    files=[bw1, bw2],
    matrix_options=MatrixOptions(dtype=np.float32),
)
genomicarrays.build_genomicarray.build_genomicarray(files, output_path, features, genome_fasta, sample_metadata=None, sample_metadata_options=SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None), matrix_options=MatrixOptions(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4), feature_annotation_options=FeatureAnnotationOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types={'seqnames': 'ascii', 'starts': 'int', 'ends': 'int'}, aggregate_function=None, expected_agg_function_length=1), optimize_tiledb=True, num_threads=1)[source]

Generate the GenomicArrayDatset.

All files are expected to be consistent and any modifications to make them consistent is outside the scope of this function and package.

Parameters:
  • files (list) – List of file paths to BigWig files.

  • output_path (str) – Path to where the output TileDB files should be stored.

  • features (Union[str, DataFrame]) –

    A DataFrame containing the input genomic intervals..

    Alternatively, may provide path to the file containing a list of intervals. In this case, the first row is expected to contain the column names, “seqnames”, “starts” and “ends”.

    Additionally, the file may contain a column sequences, to specify the sequence string for each region. Otherwise, provide a link to the fasta file using the genome_fasta parameter.

  • genome_fasta (str) –

    Path to a fasta file containing the sequence information.

    Sequence information will be updated for each region in features if the DataFrame/path does not contain a column sequences.

  • sample_metadata (Union[DataFrame, str]) –

    A DataFrame containing the sample metadata for each file in files. Hences the number of rows in the dataframe must match the number of files.

    Alternatively, may provide path to the file containing a concatenated sample metadata across all BigWig files. In this case, the first row is expected to contain the column names.

    Additionally, the order of rows is expected to be in the same order as the input list of files.

    Defaults to None, in which case, we create a simple sample metadata dataframe containing the list of datasets, aka each BigWig files. Each dataset is named as sample_{i} where i refers to the index position of the object in files.

  • sample_metadata_options (SampleMetadataOptions) – Optional parameters when generating sample_metadata store.

  • matrix_options (MatrixOptions) – Optional parameters when generating matrix store.

  • feature_annotation_options (FeatureAnnotationOptions) – Optional parameters when generating feature_annotation store.

  • optimize_tiledb (bool) – Whether to run TileDB’s vaccum and consolidation (may take long).

  • num_threads (int) – Number of threads. Defaults to 1.

genomicarrays.build_options module

class genomicarrays.build_options.FeatureAnnotationOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types=None, aggregate_function=None, expected_agg_function_length=1)[source]

Bases: object

Optional arguments for the feature store for build_genomicarray().

skip

Whether to skip generating sample TileDB. Defaults to False.

dtype

NumPy dtype for the sample dimension. Defaults to np.uint32.

Note: make sure the number of features fit within the integer limits of chosen dtype.

tiledb_store_name

Name of the TileDB file. Defaults to “feature_annotation”.

column_types

A dictionary containing column names as keys and the value representing the type to in the TileDB.

If None, all columns are cast as ‘ascii’.

aggregate_function

A callable to summarize the values in a given interval. The aggregate function is expected to return either a scalar value or a 1-dimensional NumPy ndarray.

Defaults to None.

expected_agg_function_length

Length of the output when a agg function is applied to an interval. Defaults to 1, expecting a scalar.

Note: ndarrays will be flattenned before writing to TileDB.

__annotations__ = {'aggregate_function': typing.Optional[typing.Callable], 'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'expected_agg_function_length': <class 'int'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'aggregate_function': Field(name='aggregate_function',type=typing.Optional[typing.Callable],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'expected_agg_function_length': Field(name='expected_agg_function_length',type=<class 'int'>,default=1,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='feature_annotation',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='feature_annotation', column_types=None, aggregate_function=None, expected_agg_function_length=1)
__match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types', 'aggregate_function', 'expected_agg_function_length')
__post_init__()[source]
__repr__()

Return repr(self).

aggregate_function: Optional[Callable] = None
column_types: Dict[str, dtype] = None
dtype

alias of uint32

expected_agg_function_length: int = 1
skip: bool = False
tiledb_store_name: str = 'feature_annotation'
class genomicarrays.build_options.MatrixOptions(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4)[source]

Bases: object

Optional arguments for the matrix store for build_genomicarray().

matrix_attr_name

Name of the matrix to be stored in the TileDB file. Defaults to “data”.

skip

Whether to skip generating matrix TileDB. Defaults to False.

dtype

NumPy dtype for the values in the matrix. Defaults to np.uint16.

Note: make sure the matrix values fit within the range limits of chosen-dtype.

tiledb_store_name

Name of the TileDB file. Defaults to coverage.

chunk_size

Size of chunks for parallel processing.

compression

TileDB compression filter (None, ‘gzip’, ‘zstd’, ‘lz4’).

compression_level

Compression level (1-9).

__annotations__ = {'chunk_size': <class 'int'>, 'compression': typing.Literal['zstd', 'gzip', 'lz4'], 'compression_level': <class 'int'>, 'dtype': <class 'numpy.dtype'>, 'matrix_attr_name': <class 'str'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'chunk_size': Field(name='chunk_size',type=<class 'int'>,default=1000,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'compression': Field(name='compression',type=typing.Literal['zstd', 'gzip', 'lz4'],default='zstd',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'compression_level': Field(name='compression_level',type=<class 'int'>,default=4,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.float32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'matrix_attr_name': Field(name='matrix_attr_name',type=<class 'str'>,default='data',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='coverage',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, matrix_attr_name='data', dtype=<class 'numpy.float32'>, tiledb_store_name='coverage', chunk_size=1000, compression='zstd', compression_level=4)
__match_args__ = ('skip', 'matrix_attr_name', 'dtype', 'tiledb_store_name', 'chunk_size', 'compression', 'compression_level')
__post_init__()[source]

Validate configuration.

__repr__()

Return repr(self).

chunk_size: int = 1000
compression: Literal['zstd', 'gzip', 'lz4'] = 'zstd'
compression_level: int = 4
dtype

alias of float32

matrix_attr_name: str = 'data'
skip: bool = False
tiledb_store_name: str = 'coverage'
class genomicarrays.build_options.SampleMetadataOptions(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)[source]

Bases: object

Optional arguments for the sample store for build_genomicarray().

skip

Whether to skip generating sample TileDB. Defaults to False.

dtype

NumPy dtype for the sample dimension. Defaults to np.uint32.

Note: make sure the number of samples fit within the integer limits of chosen dtype.

tiledb_store_name

Name of the TileDB file. Defaults to “sample_metadata”.

column_types

A dictionary containing column names as keys and the value representing the type to in the TileDB.

If None, all columns are cast as ‘ascii’.

__annotations__ = {'column_types': typing.Dict[str, numpy.dtype], 'dtype': <class 'numpy.dtype'>, 'skip': <class 'bool'>, 'tiledb_store_name': <class 'str'>}
__dataclass_fields__ = {'column_types': Field(name='column_types',type=typing.Dict[str, numpy.dtype],default=None,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'dtype': Field(name='dtype',type=<class 'numpy.dtype'>,default=<class 'numpy.uint32'>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'skip': Field(name='skip',type=<class 'bool'>,default=False,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD), 'tiledb_store_name': Field(name='tiledb_store_name',type=<class 'str'>,default='sample_metadata',default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False)
__eq__(other)

Return self==value.

__hash__ = None
__init__(skip=False, dtype=<class 'numpy.uint32'>, tiledb_store_name='sample_metadata', column_types=None)
__match_args__ = ('skip', 'dtype', 'tiledb_store_name', 'column_types')
__repr__()

Return repr(self).

column_types: Dict[str, dtype] = None
dtype

alias of uint32

skip: bool = False
tiledb_store_name: str = 'sample_metadata'

genomicarrays.buildutils_tiledb_array module

genomicarrays.buildutils_tiledb_array.create_tiledb_array(tiledb_uri_path, x_dim_length=None, y_dim_length=None, x_dim_name='feature_index', y_dim_name='sample_index', matrix_attr_name='data', x_dim_dtype=<class 'numpy.uint32'>, y_dim_dtype=<class 'numpy.uint32'>, matrix_dim_dtype=<class 'numpy.uint32'>, is_sparse=True)[source]

Create a TileDB file with the provided attributes to persistent storage.

This will materialize the array directory and all related schema files.

Parameters:
  • tiledb_uri_path (str) – Path to create the array TileDB file.

  • x_dim_length (int) – Number of entries along the x/fastest-changing dimension. e.g. Number of cells. Defaults to None, in which case, the max integer value of x_dim_dtype is used.

  • y_dim_length (int) – Number of entries along the y dimension. e.g. Number of genes. Defaults to None, in which case, the max integer value of y_dim_dtype is used.

  • x_dim_name (str) – Name for the x-dimension. Defaults to “feature_index”.

  • y_dim_name (str) – Name for the y-dimension. Defaults to “sample_index”.

  • matrix_attr_name (str) – Name for the attribute in the array. Defaults to “data”.

  • x_dim_dtype (dtype) – NumPy dtype for the x-dimension. Defaults to np.uint32.

  • y_dim_dtype (dtype) – NumPy dtype for the y-dimension. Defaults to np.uint32.

  • matrix_dim_dtype (dtype) – NumPy dtype for the values in the matrix. Defaults to np.uint32.

  • is_sparse (bool) – Whether the matrix is sparse. Defaults to True.

genomicarrays.buildutils_tiledb_array.optimize_tiledb_array(tiledb_array_uri, verbose=True)[source]

Consolidate TileDB fragments.

genomicarrays.buildutils_tiledb_array.write_array_chunks_to_tiledb(tiledb_array_uri, data, x_idx, y_idx, value_dtype=<class 'numpy.uint32'>)[source]

Write chunks of array to the tiledb.

genomicarrays.buildutils_tiledb_array.write_frame_intervals_to_tiledb(tiledb_array_uri, data, y_idx, value_dtype=<class 'numpy.float32'>)[source]

Append and save array data to TileDB. Expect data for one full sample (column).

Parameters:
  • tiledb_array_uri (Union[str, SparseArray]) – TileDB array object or path to a TileDB object.

  • data (ndarray) – numpy array to write to TileDB, must contain columns, “start”, “end” and “value”.

  • value_dtype (dtype) – NumPy dtype to reformat the matrix values. Defaults to float32.

genomicarrays.dataloader module

genomicarrays.queryutils_tiledb_frame module

genomicarrays.queryutils_tiledb_frame.get_a_column(tiledb_obj, column_name)[source]

Access column(s) from the TileDB object.

Parameters:
  • tiledb_obj (Array) – A TileDB object.

  • column_name (Union[str, List[str]]) – Name(s) of the column to access.

Return type:

list

Returns:

List containing the column values.

genomicarrays.queryutils_tiledb_frame.get_index(tiledb_obj)[source]

Get the index of the TileDB object.

Parameters:

tiledb_obj (Array) – A TileDB object.

Return type:

list

Returns:

A list containing the index values.

genomicarrays.queryutils_tiledb_frame.get_schema_names_frame(tiledb_obj)[source]

Get Attributes from a TileDB object.

Parameters:

tiledb_obj (Array) – A TileDB object.

Return type:

List[str]

Returns:

List of schema attributes.

genomicarrays.queryutils_tiledb_frame.subset_array(tiledb_obj, row_subset, column_subset, shape)[source]

Subset a tiledb storing array data.

Uses multi_index to slice.

Parameters:
  • tiledb_obj (Array) – A TileDB object

  • row_subset (Union[slice, list, tuple]) – Subset along the row axis.

  • column_subset (Union[slice, list, tuple]) – Subset along the column axis.

  • shape (tuple) – Shape of the entire matrix.

Return type:

ndarray

Returns:

A dense array containing coverage.

genomicarrays.queryutils_tiledb_frame.subset_frame(tiledb_obj, subset, columns)[source]

Subset a TileDB object.

Parameters:
  • tiledb_obj (Array) – TileDB object to subset.

  • subset (Union[slice, QueryCondition]) –

    A slice to subset.

    Alternatively, may provide a QueryCondition to subset the object.

  • columns (list) – List specifying the atrributes from the schema to extract.

Return type:

DataFrame

Returns:

A sliced DataFrame or a matrix with the subset.

genomicarrays.utils_bw module

genomicarrays.utils_bw.extract_bw_intervals_as_vec(bw_path, intervals, total_length, val_dtype=<class 'numpy.float32'>)[source]

Extract data from BigWig for a given region.

Parameters:
  • bw_path (str) – Path to the BigWig file.

  • intervals (DataFrame) – List of intervals to extract.

  • total_length (int) – Size of all the regions.

  • val_dtype (dtype) – Dtype of the resulting array.

Return type:

ndarray

Returns:

A vector with length as the number of intervals, a value if the file contains the data for the corresponding region or np.nan if the region is not measured.

genomicarrays.utils_bw.extract_bw_values(bw_path, chrom)[source]
Return type:

Tuple[ndarray, int]

genomicarrays.utils_bw.extract_bw_values_as_vec(bw_path, intervals, total_length, agg_func=None, val_dtype=<class 'numpy.float32'>, outsize_per_feature=1)[source]

Extract data from BigWig for a given region and apply the aggregate function.

Parameters:
  • bw_path (str) – Path to the BigWig file.

  • intervals (DataFrame) – List of intervals to extract.

  • agg_func (Optional[callable]) – Aggregate function to apply. Defaults to None.

  • val_dtype (dtype) – Dtype of the resulting array.

  • total_length (int) – Size of all the regions.

  • outsize_per_feature (int) – Expected length of output after applying the agg_func.

Return type:

ndarray

Returns:

A vector with length as number of intervals X outsize_per_feature, a value if the file contains the data for the corresponding region or np.nan if the region is not measured.

genomicarrays.utils_bw.wrapper_extract_bw_values(bw_path, intervals, agg_func, val_dtype=<class 'numpy.float32'>, total_length=None, outsize_per_feature=1)[source]
Return type:

ndarray

Module contents