cellarr.ml package¶
Submodules¶
cellarr.ml.autoencoder module¶
- class cellarr.ml.autoencoder.AutoEncoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]¶
Bases:
LightningModule
A class encapsulating training.
- __annotations__ = {}¶
- __init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]¶
Constructor.
- Parameters:
n_genes (
int
) – The number of genes in the gene space, representing the input dimensions.latent_dim (
int
) – The latent space dimensions. Defaults to 128.hidden_dim (
List
[int
]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.dropout (
float
) – The dropout rate for hidden layersinput_dropout (
float
) – The dropout rate for the input layerlr (
float
) – The initial learning rateresidual (
bool
) – Use residual connections.
- forward(x)[source]¶
Forward.
- Parameters:
x – Input tensor corresponding to input layer.
- Returns:
Output tensor corresponding to the last encoder layer.
Output tensor corresponding to the last decoder layer.
- get_loss(batch)[source]¶
Calculate the loss.
- Parameters:
batch – A batch as defined by a pytorch DataLoader.
- Returns:
The training loss
- class cellarr.ml.autoencoder.Decoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]¶
Bases:
Module
A class that encapsulates the decoder.
- __annotations__ = {}¶
- __init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]¶
Constructor.
- Parameters:
n_genes (
int
) – The number of genes in the gene space, representing the input dimensions.latent_dim (
int
) – The latent space dimensionshidden_dim (
List
[int
]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.dropout (
float
) – The dropout rate for hidden layersresidual (
bool
) – Use residual connections.
- forward(x)[source]¶
Forward.
- Parameters:
x – Input tensor corresponding to input layer.
- Return type:
Tensor
- Returns:
Output tensor corresponding to output layer.
- class cellarr.ml.autoencoder.Encoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]¶
Bases:
Module
A class that encapsulates the encoder.
- __annotations__ = {}¶
- __init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]¶
Constructor.
- Parameters:
n_genes (
int
) – The number of genes in the gene space, representing the input dimensions.latent_dim (
int
) – The latent space dimensionshidden_dim (
List
[int
]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.dropout (
float
) – The dropout rate for hidden layersinput_dropout (
float
) – The dropout rate for the input layerresidual (
bool
) – Use residual connections.
- forward(x)[source]¶
Forward.
- Parameters:
x – torch.Tensor Input tensor corresponding to input layer.
- Return type:
Tensor
- Returns:
Output tensor corresponding to output layer.
cellarr.ml.dataloader module¶
A dataloader using TileDB files in the pytorch-lightning framework.
This class provides a dataloader using the generated TileDB files built using the
build_cellarrdataset()
.
Example
from cellarr.dataloader import (
DataModule,
)
datamodule = DataModule(
dataset_path="/path/to/cellar/dir",
cell_metadata_uri="cell_metadata",
gene_annotation_uri="gene_annotation",
matrix_uri="counts",
val_studies=[
"test3"
],
label_column_name="label",
study_column_name="study",
batch_size=100,
lognorm=True,
target_sum=1e4,
)
dataloader = datamodule.train_dataloader()
batch = next(
iter(dataloader)
)
(
data,
labels,
studies,
) = batch
print(
data,
labels,
studies,
)
- class cellarr.ml.dataloader.BaseBatchSampler(data_df, int2sample, bsz, shuffle=True, **kwargs)[source]¶
Bases:
Sampler
[int
]Simplest sampler class for composition of samples in minibatch.
- __orig_bases__ = (torch.utils.data.sampler.Sampler[int],)¶
- __parameters__ = ()¶
- class cellarr.ml.dataloader.DataModule(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.ml.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.ml.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]¶
Bases:
LightningDataModule
A class that extends a pytorch-lightning
LightningDataModule
to create pytorch dataloaders using TileDB.The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.
- __annotations__ = {}¶
- __init__(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.ml.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.ml.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]¶
Initialize a
DataModule
.- Parameters:
dataset_path (
str
) – Path to the directory containing the TileDB stores. Usually theoutput_path
from thebuild_cellarrdataset()
.cell_metadata_uri (
str
) – Relative path to cell metadata store.gene_annotation_uri (
str
) – Relative path to gene annotation store.matrix_uri (
str
) – Relative path to matrix store.label_column_name (
str
) – Column name in cell_metadata_uri containing cell labels.study_column_name (
str
) – Column name in cell_metadata_uri containing study information.val_studies (
Optional
[List
[str
]]) – List of studies to use for validation and test. If None, all studies are used for training.gene_order (
Optional
[List
[str
]]) – List of genes to subset to from the gene space. If None, all genes from the gene_annotation are used for training.batch_size (
int
) – Batch size to use, corresponding to the number of samples in a mini-batch. Defaults to 100.sample_size (
int
) – Size of each sample use in a mini-batch, corresponding to the number of cells in a sample. Defaults to 100.num_workers (
int
) – The number of worker threads for dataloaders. Defaults to 1.lognorm (
bool
) – Whether to return log-normalized expression instead of raw counts.target_sum (
float
) – Target sum for log-normalization.sparse (
bool
) – Whether to return a sparse tensor. Defaults to False.sampling_by_class (
bool
) – Sample based on class counts, where sampling weight is inversely proportional to count. If False, use random sampling. Defaults to False.remove_singleton_classes (
bool
) – Exclude cells with classes that exist in only one sample. Defaults to False.min_sample_size (
Optional
[int
]) – Set a minimum number of cells in a sample for it to be valid. Defaults to Nonenan_string (
str
) – A string representing NaN. Defaults to “nan”.sampler_cls (
Sampler
) – Sampler class to use for batching. Defauls to BaseBatchSampler.dataset_cls (
Dataset
) – Dataset, default: scDataset Base Dataset class to use. Defaults to scDataset.persistent_workers (
bool
) – If True, uses persistent workers in the DataLoaders.multiprocessing_context (
str
) – Multiprocessing context to use for the DataLoaders. Defaults to “spawn”.
- collate(batch)[source]¶
Collate tensors.
- Parameters:
batch – Batch to collate.
- Returns:
- tuple
A Tuple[torch.Tensor, torch.Tensor, np.ndarray, np.ndarray] containing information corresponding to [input, label, study, sample]
- class cellarr.ml.dataloader.scDataset(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]¶
Bases:
Dataset
A class that extends pytorch
Dataset
to enumerate cells and cell metadata using TileDB.- __annotations__ = {}¶
- __init__(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]¶
Initialize a
scDataset
.- Parameters:
data_df (
DataFrame
) – Pandas dataframe of valid cells.int2sample (
dict
) – A mapping of sample index to sample id.sample2cells (
dict
) – A mapping of sample id to cell indices.sample_size (
int
) – Number of cells one sample.sampling_by_class (
bool
) – Sample based on class counts, where sampling weight is inversely proportional to count. Defaults to False.
- __parameters__ = ()¶