cellarr_array.dataloaders package

Submodules

cellarr_array.dataloaders.denseloader module

class cellarr_array.dataloaders.denseloader.DenseArrayDataset(array_uri, attribute_name='data', num_rows=None, num_columns=None, cellarr_ctx_config=None, transform=None)[source]

Bases: Dataset

__getitem__(idx)[source]
__init__(array_uri, attribute_name='data', num_rows=None, num_columns=None, cellarr_ctx_config=None, transform=None)[source]

PyTorch Dataset for dense TileDB arrays accessed via DenseCellArray.

Parameters:
  • array_uri (str) – URI of the TileDB dense array.

  • attribute_name (str) – Name of the attribute to read from.

  • num_rows (Optional[int]) – Total number of rows in the dataset. If None, will infer from array.shape[0].

  • num_columns (Optional[int]) – The number of columns in the dataset. If None, will attempt to infer from array.shape[1].

  • cellarr_ctx_config (Optional[dict]) – Optional TileDB context configuration dict for CellArray.

  • transform – Optional transform to be applied on a sample.

__len__()[source]
__parameters__ = ()
cellarr_array.dataloaders.denseloader.construct_dense_array_dataloader(array_uri, attribute_name='data', num_rows=None, num_columns=None, batch_size=1000, num_workers_dl=2)[source]

Construct an instance of DenseArrayDataset with PyTorch DataLoader.

Parameters:
  • array_uri (str) – URI of the TileDB array.

  • attribute_name (str) – Name of the attribute to read from.

  • num_rows (Optional[int]) – The total number of rows in the TileDB array.

  • num_columns (Optional[int]) – The total number of columns in the TileDB array.

  • batch_size (int) – Number of random samples per batch generated by the dataset.

  • num_workers_dl (int) – Number of worker processes for the DataLoader.

Return type:

DataLoader

cellarr_array.dataloaders.iterabledataloader module

class cellarr_array.dataloaders.iterabledataloader.CellArrayIterableDataset(array_uri, attribute_name, num_rows, num_columns, is_sparse, batch_size=1000, num_yields_per_epoch_per_worker=None, cellarr_ctx_config=None, transform=None)[source]

Bases: IterableDataset

A PyTorch IterableDataset that yields batches of randomly sampled rows from a TileDB array (dense or sparse) using cellarr-array.

An IterableDataset dataset is responsible for yielding entire batches of data, giving us full control over how a batch is formed, including performing a single bulk read from TileDB.

__abstractmethods__ = frozenset({})
__annotations__ = {}
__init__(array_uri, attribute_name, num_rows, num_columns, is_sparse, batch_size=1000, num_yields_per_epoch_per_worker=None, cellarr_ctx_config=None, transform=None)[source]

Initializes the CellArrayIterableDataset.

Parameters:
  • array_uri (str) – URI of the TileDB array.

  • attribute_name (str) – Name of the TileDB attribute to read.

  • num_rows (int) – The total number of rows in the TileDB array.

  • num_columns (int) – The total number of columns in the TileDB array.

  • is_sparse (bool) – True if the TileDB array is sparse, False if dense.

  • batch_size (int) – The number of random samples to include in each yielded batch. Defaults to 1000.

  • num_yields_per_epoch_per_worker (Optional[int]) – The number of batches this dataset’s iterator (per worker) will yield in one epoch. If None, it defaults to roughly covering all samples once across all workers (approx). The total batches seen by the training loop will be num_workers * num_yields_per_epoch_per_worker. Defaults to None.

  • cellarr_ctx_config (Optional[Dict]) – Configuration dictionary for the TileDB context used by cellarr-array. Defaults to None.

  • transform (Optional[Callable]) – A function/transform that takes the entire fetched batch (NumPy array for dense, SciPy sparse matrix for sparse) and returns a transformed version. Defaults to None.

__iter__()[source]

Yields batches of randomly sampled data.

This method is called by the DataLoader for each worker.

Return type:

Iterator[Union[ndarray, spmatrix]]

__parameters__ = ()
cellarr_array.dataloaders.iterabledataloader.construct_iterable_dataloader(array_uri, is_sparse, attribute_name='data', num_rows=None, num_columns=None, batch_size=1000, num_workers_dl=2, num_yields_per_worker=5)[source]

Construct an instance of CellArrayIterableDataset with PyTorch DataLoader.

Parameters:
  • array_uri (str) – URI of the TileDB array.

  • attribute_name (str) – Name of the attribute to read from.

  • num_rows (int) – The total number of rows in the TileDB array.

  • num_columns (int) – The total number of columns in the TileDB array.

  • is_sparse (bool) – True if the array is sparse, False for dense.

  • batch_size (int) – Number of random samples per batch generated by the dataset.

  • num_workers_dl (int) – Number of worker processes for the DataLoader.

  • num_yields_per_worker (int) – Number of batches each worker should yield per epoch.

Return type:

DataLoader

cellarr_array.dataloaders.iterabledataloader.dense_batch_collate_fn(numpy_batch)[source]

Collate function for a dense batch from CellArrayIterableDataset.

Receives the numpy_batch directly from the dataset’s iterator.

Return type:

Tensor

cellarr_array.dataloaders.iterabledataloader.sparse_batch_collate_fn(scipy_sparse_batch)[source]

Collate function for a sparse batch from CellArrayIterableDataset.

Receives the scipy_sparse_batch directly from the dataset’s iterator.

Return type:

Tensor

cellarr_array.dataloaders.sparseloader module

class cellarr_array.dataloaders.sparseloader.SparseArrayDataset(array_uri, attribute_name='data', num_rows=None, num_columns=None, sparse_format=<class 'scipy.sparse._csr.csr_matrix'>, cellarr_ctx_config=None, transform=None)[source]

Bases: Dataset

__annotations__ = {}
__getitem__(idx)[source]
__init__(array_uri, attribute_name='data', num_rows=None, num_columns=None, sparse_format=<class 'scipy.sparse._csr.csr_matrix'>, cellarr_ctx_config=None, transform=None)[source]

PyTorch Dataset for sparse TileDB arrays accessed via SparseCellArray.

Parameters:
  • array_uri (str) – URI of the TileDB sparse array.

  • attribute_name (str) – Name of the attribute to read from.

  • num_rows (Optional[int]) – Total number of rows in the dataset. If None, will infer from array.shape[0].

  • num_columns (Optional[int]) – The number of columns in the dataset. If None, will attempt to infer from array.shape[1].

  • sparse_format – Format to return, defaults to csr_matrix.

  • cellarr_ctx_config (Optional[dict]) – Optional TileDB context configuration dict for CellArray.

  • transform – Optional transform to be applied on a sample.

__len__()[source]
__parameters__ = ()
cellarr_array.dataloaders.sparseloader.construct_sparse_array_dataloader(array_uri, attribute_name='data', num_rows=None, num_columns=None, batch_size=1000, num_workers_dl=2)[source]

Construct an instance of SparseArrayDataset with PyTorch DataLoader.

Parameters:
  • array_uri (str) – URI of the TileDB array.

  • attribute_name (str) – Name of the attribute to read from.

  • num_rows (Optional[int]) – The total number of rows in the TileDB array.

  • num_columns (Optional[int]) – The total number of columns in the TileDB array.

  • batch_size (int) – Number of random samples per batch generated by the dataset.

  • num_workers_dl (int) – Number of worker processes for the DataLoader.

Return type:

DataLoader

cellarr_array.dataloaders.sparseloader.sparse_coo_collate_fn(batch)[source]

Custom collate_fn for a batch of SciPy COO sparse matrices.

Converts them into a single batched PyTorch sparse COO tensor.

Each item in ‘batch’ is a SciPy coo_matrix representing one sample.

cellarr_array.dataloaders.utils module

cellarr_array.dataloaders.utils.seed_worker(worker_id)[source]

Generate seeds for a PyTorch DataLoader worker.

This ensures that if multiple workers are sampling randomly, they use different sequences of random numbers.

Parameters:

worker_id (int) – The ID of the worker process.

Module contents