cellarr.ml package

Submodules

cellarr.ml.autoencoder module

class cellarr.ml.autoencoder.AutoEncoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]

Bases: LightningModule

A class encapsulating training.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, lr=0.005, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions. Defaults to 128.

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • input_dropout (float) – The dropout rate for the input layer

  • lr (float) – The initial learning rate

  • residual (bool) – Use residual connections.

configure_optimizers()[source]

Configure optimizers.

forward(x)[source]

Forward.

Parameters:

x – Input tensor corresponding to input layer.

Returns:

Output tensor corresponding to the last encoder layer.

Output tensor corresponding to the last decoder layer.

get_loss(batch)[source]

Calculate the loss.

Parameters:

batch – A batch as defined by a pytorch DataLoader.

Returns:

The training loss

load_state(encoder_filename, decoder_filename, use_gpu=False)[source]

Load model state.

Parameters:
  • encoder_filename (str) – Filename containing the encoder model state.

  • decoder_filename (str) – Filename containing the decoder model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

on_validation_epoch_end()[source]

Pytorch-lightning validation epoch end evaluation.

on_validation_epoch_start()[source]

Pytorch-lightning validation epoch start.

save_all(model_path)[source]
training_step(batch, batch_idx)[source]

Pytorch-lightning training step.

validation_step(batch, batch_idx)[source]

Pytorch-lightning validation step.

class cellarr.ml.autoencoder.Decoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]

Bases: Module

A class that encapsulates the decoder.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • residual (bool) – Use residual connections.

forward(x)[source]

Forward.

Parameters:

x – Input tensor corresponding to input layer.

Return type:

Tensor

Returns:

Output tensor corresponding to output layer.

load_state(filename, use_gpu=False)[source]

Load model state.

Parameters:
  • filename (str) – Filename containing the model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

save_state(filename)[source]

Save model state.

Parameters:

filename (str) – Filename to save the model state.

class cellarr.ml.autoencoder.Encoder(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]

Bases: Module

A class that encapsulates the encoder.

__annotations__ = {}
__init__(n_genes, latent_dim=128, hidden_dim=[1024, 1024], dropout=0.5, input_dropout=0.4, residual=False)[source]

Constructor.

Parameters:
  • n_genes (int) – The number of genes in the gene space, representing the input dimensions.

  • latent_dim (int) – The latent space dimensions

  • hidden_dim (List[int]) – A list of hidden layer dimensions, describing the number of layers and their dimensions. Hidden layers are constructed in the order of the list for the encoder and in reverse for the decoder.

  • dropout (float) – The dropout rate for hidden layers

  • input_dropout (float) – The dropout rate for the input layer

  • residual (bool) – Use residual connections.

forward(x)[source]

Forward.

Parameters:

x – torch.Tensor Input tensor corresponding to input layer.

Return type:

Tensor

Returns:

Output tensor corresponding to output layer.

load_state(filename, use_gpu=False)[source]

Load model state.

Parameters:
  • filename (str) – Filename containing the model state.

  • use_gpu (bool) – Boolean indicating whether or not to use GPUs.

save_state(filename)[source]

Save model state.

Parameters:

filename (str) – Filename to save the model state.

cellarr.ml.dataloader module

A dataloader using TileDB files in the pytorch-lightning framework.

This class provides a dataloader using the generated TileDB files built using the build_cellarrdataset().

Example

from cellarr.dataloader import (
    DataModule,
)

datamodule = DataModule(
    dataset_path="/path/to/cellar/dir",
    cell_metadata_uri="cell_metadata",
    gene_annotation_uri="gene_annotation",
    matrix_uri="counts",
    val_studies=[
        "test3"
    ],
    label_column_name="label",
    study_column_name="study",
    batch_size=100,
    lognorm=True,
    target_sum=1e4,
)

dataloader = datamodule.train_dataloader()
batch = next(
    iter(dataloader)
)
(
    data,
    labels,
    studies,
) = batch
print(
    data,
    labels,
    studies,
)
class cellarr.ml.dataloader.BaseBatchSampler(data_df, int2sample, bsz, shuffle=True, **kwargs)[source]

Bases: Sampler[int]

Simplest sampler class for composition of samples in minibatch.

__init__(data_df, int2sample, bsz, shuffle=True, **kwargs)[source]

Constructor.

Parameters:
  • data_df (DataFrame) – DataFrame with columns “study::::sample”

  • int2sample (dict) – Dictionary mapping integer to sample id

  • bsz (int) – Batch size

  • shuffle (bool) – Whether to shuffle the samples across epochs

__iter__()[source]
__len__()[source]
Return type:

int

__orig_bases__ = (torch.utils.data.sampler.Sampler[int],)
__parameters__ = ()
class cellarr.ml.dataloader.DataModule(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.ml.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.ml.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]

Bases: LightningDataModule

A class that extends a pytorch-lightning LightningDataModule to create pytorch dataloaders using TileDB.

The dataloader uniformly samples across training labels and study labels to create a diverse batch of cells.

__annotations__ = {}
__del__()[source]
__init__(dataset_path, cell_metadata_uri='cell_metadata', gene_annotation_uri='gene_annotation', matrix_uri='assays/counts', label_column_name='celltype_id', study_column_name='study', sample_column_name='cellarr_sample', val_studies=None, gene_order=None, batch_size=100, sample_size=100, num_workers=1, lognorm=True, target_sum=10000.0, sparse=False, sampling_by_class=False, remove_singleton_classes=False, min_sample_size=None, nan_string='nan', sampler_cls=<class 'cellarr.ml.dataloader.BaseBatchSampler'>, dataset_cls=<class 'cellarr.ml.dataloader.scDataset'>, persistent_workers=False, multiprocessing_context='spawn')[source]

Initialize a DataModule.

Parameters:
  • dataset_path (str) – Path to the directory containing the TileDB stores. Usually the output_path from the build_cellarrdataset().

  • cell_metadata_uri (str) – Relative path to cell metadata store.

  • gene_annotation_uri (str) – Relative path to gene annotation store.

  • matrix_uri (str) – Relative path to matrix store.

  • label_column_name (str) – Column name in cell_metadata_uri containing cell labels.

  • study_column_name (str) – Column name in cell_metadata_uri containing study information.

  • val_studies (Optional[List[str]]) – List of studies to use for validation and test. If None, all studies are used for training.

  • gene_order (Optional[List[str]]) – List of genes to subset to from the gene space. If None, all genes from the gene_annotation are used for training.

  • batch_size (int) – Batch size to use, corresponding to the number of samples in a mini-batch. Defaults to 100.

  • sample_size (int) – Size of each sample use in a mini-batch, corresponding to the number of cells in a sample. Defaults to 100.

  • num_workers (int) – The number of worker threads for dataloaders. Defaults to 1.

  • lognorm (bool) – Whether to return log-normalized expression instead of raw counts.

  • target_sum (float) – Target sum for log-normalization.

  • sparse (bool) – Whether to return a sparse tensor. Defaults to False.

  • sampling_by_class (bool) – Sample based on class counts, where sampling weight is inversely proportional to count. If False, use random sampling. Defaults to False.

  • remove_singleton_classes (bool) – Exclude cells with classes that exist in only one sample. Defaults to False.

  • min_sample_size (Optional[int]) – Set a minimum number of cells in a sample for it to be valid. Defaults to None

  • nan_string (str) – A string representing NaN. Defaults to “nan”.

  • sampler_cls (Sampler) – Sampler class to use for batching. Defauls to BaseBatchSampler.

  • dataset_cls (Dataset) – Dataset, default: scDataset Base Dataset class to use. Defaults to scDataset.

  • persistent_workers (bool) – If True, uses persistent workers in the DataLoaders.

  • multiprocessing_context (str) – Multiprocessing context to use for the DataLoaders. Defaults to “spawn”.

__repr__()[source]
Return type:

str

Returns:

A string representation.

collate(batch)[source]

Collate tensors.

Parameters:

batch – Batch to collate.

Returns:

tuple

A Tuple[torch.Tensor, torch.Tensor, np.ndarray, np.ndarray] containing information corresponding to [input, label, study, sample]

filter_db()[source]
train_dataloader()[source]

Load the training dataset.

Return type:

DataLoader

Returns:

A DataLoader object containing the training dataset.

val_dataloader()[source]

Load the validation dataset.

Return type:

DataLoader

Returns:

A DataLoader object containing the validation dataset.

class cellarr.ml.dataloader.scDataset(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]

Bases: Dataset

A class that extends pytorch Dataset to enumerate cells and cell metadata using TileDB.

__annotations__ = {}
__getitem__(idx)[source]
__init__(data_df, int2sample, sample2cells, sample_size, sampling_by_class=False)[source]

Initialize a scDataset.

Parameters:
  • data_df (DataFrame) – Pandas dataframe of valid cells.

  • int2sample (dict) – A mapping of sample index to sample id.

  • sample2cells (dict) – A mapping of sample id to cell indices.

  • sample_size (int) – Number of cells one sample.

  • sampling_by_class (bool) – Sample based on class counts, where sampling weight is inversely proportional to count. Defaults to False.

__len__()[source]
__parameters__ = ()
__repr__()[source]
Return type:

str

Returns:

A string representation.

Module contents