# CellArr Collections: Using cellxgene datasets

So you've got loads of single-cell data piling up on your hard drive, and making your analysis workflows slower? Welcome to the club!

Cell Arrays is a Python package that provides a TileDB-backed store for large collections of genomic experimental data, such as millions of cells across multiple single-cell experiment objects.

The `CellArrDataset` is designed to store single-cell RNA-seq datasets but can be generalized to store any 2-dimensional experimental data.

## Tutorial: A Real-World Example

To demonstrate how to build your own `CellArr` compatible TileDB collection, we'll download a few datasets from cellxgene.

- Buccal and Labial Atlas (30K cells): https://datasets.cellxgene.cziscience.com/e4e24745-5822-4838-8ad9-41e116b71964.h5ad
- Astrocytes (18K cells): https://datasets.cellxgene.cziscience.com/e2012ca6-059f-4af7-80fe-886579bfaeee.h5ad
- Microglia (1.6K cells): https://datasets.cellxgene.cziscience.com/f9bca543-f82a-4863-bd62-b270b634ea10.h5ad
- Oligodendrocytes (25K cells): https://datasets.cellxgene.cziscience.com/da57b970-3c33-49bb-a951-bfc9b99dbd23.h5ad
- thymus scRNA-seq atlas - B cell subset: https://datasets.cellxgene.cziscience.com/fc6eb410-6603-4e98-9d5f-32c20ffa2456.h5ad

Don't worry if your workflow is different - the beauty of `CellArr` is, its flexibility. The only thing you need to care about are the key column names that make your files CellArr-compatible. Refer to the image in the README for these key column names.

## Step 0: Download All the Things

First, let's grab those datasets.

```python
!wget https://datasets.cellxgene.cziscience.com/fc6eb410-6603-4e98-9d5f-32c20ffa2456.h5ad
!wget https://datasets.cellxgene.cziscience.com/da57b970-3c33-49bb-a951-bfc9b99dbd23.h5ad
!wget https://datasets.cellxgene.cziscience.com/e4e24745-5822-4838-8ad9-41e116b71964.h5ad
!wget https://datasets.cellxgene.cziscience.com/e2012ca6-059f-4af7-80fe-886579bfaeee.h5ad
!wget https://datasets.cellxgene.cziscience.com/f9bca543-f82a-4863-bd62-b270b634ea10.h5ad
```

```python
files = ["fc6eb410-6603-4e98-9d5f-32c20ffa2456.h5ad", 
        "da57b970-3c33-49bb-a951-bfc9b99dbd23.h5ad", 
        "e4e24745-5822-4838-8ad9-41e116b71964.h5ad",
        "e2012ca6-059f-4af7-80fe-886579bfaeee.h5ad",
        "f9bca543-f82a-4863-bd62-b270b634ea10.h5ad"]
```

## Step 1: Identify Your Feature Space

By default, CellArr's build scans all your files and computes a union of all features. You can customize this if:

1. You already have a pre-filtered gene list
2. You want more control over what features make the final cut

After peeking at these datasets, we notice they use Ensembl IDs in the `var` index. Let's use those as our feature space, which helps align expression vectors across datasets.

```python
def get_feature_space(path):
    ad = anndata.read_h5ad(path, "r")
    return ad.var.index.tolist()
```

```python
from tqdm import tqdm

feature_set = set()

for i in tqdm(files):
    feature_set = feature_set.union(get_feature_space(i))
    
print("Total # of features:", len(feature_set))
```

> 💡 **Pro Tip:** If you're processing enough files to finish a Netflix series while waiting, consider using joblib or other parallelization tools.

```python
import pandas as pd

sorted_features = sorted(list(feature_set))
FEATURE_DF = pd.DataFrame({"cellarr_gene_index": sorted_features})
FEATURE_DF.head()
```

> 📝 **Note:** The current version of CellArr calls the index column `cellarr_gene_index`. We might generalize this name in a future version.

## Step 2: Wrangling Cell Annotations

By default, CellArr aggregates ALL cell annotations across datasets, which can get... interesting. You have two options:

1. Combine all `obs` objects and then filter out the columns you don't need
2. Provide a list of column names and their types, and CellArr will use "None" for missing values across datasets

We'll go with option 1 since we're working with a manageable number of datasets. But be warned: with large datasets, this might need more RAM.

```python
def get_cell_annotations(path):
    ad = anndata.read_h5ad(path, "r")
    return ad.obs

from tqdm import tqdm

obs = []

for i in tqdm(files):
    obs.append(get_cell_annotations(i))

combined_obs = pd.concat(obs, ignore_index=True, axis=0, sort=False).astype(pd.StringDtype())
combined_cell_df = combined_obs.reset_index(drop=True)

# Let's keep only the columns we actually care about - goodbye, clutter!
columns_to_keep = ["donor_id", "sample_source", "tissue_type", 
                  "is_primary_data", "author_cell_type", "cell_type", "disease", "sex", "tissue"]
CELL_ANNOTATIONS_DF = combined_cell_df[columns_to_keep]
CELL_ANNOTATIONS_DF  # Behold, your streamlined metadata!
```

## Step 3a: Matrix Maneuvers (Optional)

> 💡 **Heads up:** This step is optional.

CellArr's build process only recognizes matrices in the `layers` slot of the `AnnData` structure. The files we got from cellxgene use `X` instead, so we need to do a quick switch.

```python
def process_adata(path):
    ad = anndata.read_h5ad(path)
    ad.layers["counts"] = ad.X  # Move the matrix to where CellArr expects it
    return ad

adatas = [process_adata(x) for x in files]  # Apply to all files
```

```python
sum([len(a.obs) for a in adatas])  # Total cell count - prepare to be impressed (or terrified)
```

## Step 3b: Building The TileDB Files

This is where CellArr transforms your collection of disparate datasets into a single, queryable TileDB.

```python
from cellarr import build_cellarrdataset, CellArrDataset, MatrixOptions
import numpy as np
import os
import shutil

# Let's create a fresh directory for our collection
output_dir = "./my_collection.tdb"
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
    
os.mkdir(output_dir)

# Build the dataset - this is where the magic happens!
dataset = build_cellarrdataset(
    gene_annotation=FEATURE_DF,
    cell_metadata=CELL_ANNOTATIONS_DF,
    output_path="./my_collection.tdb",
    files=adatas,
    matrix_options=MatrixOptions(matrix_name="counts", dtype=np.int16),
    num_threads=4,  # Adjust based on your CPU - more threads = more speed (usually)
)
```

> 🎉 **That's it!** Really, it's that simple.

For those running on a high-performance computing cluster, check out the [slurm-based build process](https://github.com/CellArr/cellarr?tab=readme-ov-file#building-on-hpc-environments-with-slurm).

For more complex workflows, refer to the [full CellArr documentation](https://github.com/CellArr/cellarr).

## Exploring Your New TileDB Collection

Now for the fun part - let's interact with our collection! Having a single collection opens up all sorts of possibilities:

- Explore marker expression across cell types, tissues, and disease states or any available cell metadata
- Develop new methods without data wrangling headaches
- Train machine learning models on unified data

> 🤖 **Did you know?** CellArr includes a PyTorch dataloader to jumpstart your ML training.

```python
from cellarr import CellArrDataset
import tiledb

# Optional: customize TileDB configuration
cfg = tiledb.Config()

output_dir = "./my_collection.tdb"

my_collection = CellArrDataset(output_dir, config_or_context=cfg)
my_collection 
```

Get a quick list of all available gene symbols:

```python
gene_symbols = my_collection.get_gene_annotation_index()

# Let's grab a random sample to play with
from random import sample
my_gene_list = sample(gene_symbols, 10)
print("Random genes to investigate:", my_gene_list)
```

### Subsetting: Get Only What You Need

The whole point of CellArr is to efficiently slice and dice your data. Instead of loading the entire count matrix, you can extract just the bits you care about.

Let's grab expression counts for our random gene set in the first 100 cells:

```python
expression_data = my_collection[0:100, my_gene_list]
print(expression_data)
```

The returned `CellArrDatasetSlice` contains everything you need:
1. Matrices (e.g., counts)
2. Cell annotations
3. Feature metadata

Convert this slice to your favorite format - `AnnData` for the Scanpy enthusiasts:

```python
print(expression_data.to_anndata())  # Scanpy-ready!
```

Or `SummarizedExperiment` for the Bioconductor/BiocPy ecosystem:

```python
print(expression_data.to_summarizedexperiment())
```

### Let's Make Some Plots!

Time to put our data to work with some visualizations. First, let's see what metadata we have to play with:

```python
my_collection.get_cell_metadata_columns()
```

Let's focus on specific diseases:

```python
disease_labels = my_collection.get_cell_metadata_column("disease")
disease_labels.value_counts()  # What diseases do we have?
```

Let's zoom in on Alzheimer's disease cells:

```python
import numpy as np
cells_of_interest = np.where(disease_labels == "Alzheimer disease")
print(f"Found {len(cells_of_interest[0])} cells with Alzheimer's disease!")
```

Now, let's get expression data for our random gene list in these Alzheimer's cells:

```python
ad_gene_list = my_gene_list  # Using our random genes from earlier
ad_exprs = my_collection[cells_of_interest, ad_gene_list]
print(ad_exprs)  # Our Alzheimer's disease-specific dataset
```

```python
# Let's check out the metadata for these cells
ad_exprs.cell_metadata.head()
```

Finally, let's visualize gene expression across different cell types:

```python
import pandas as pd
import seaborn as sns

# Combine expression data with metadata
gene_with_meta = pd.concat(
    [
        pd.DataFrame(
            ad_exprs.matrix["counts"].todense(), 
            columns=ad_gene_list
        ).reset_index(drop=True),
        ad_exprs.cell_metadata.reset_index(drop=True)
    ],
    axis=1
)

# Compute total expression by cell_type
mean_expression_by_cell_type = gene_with_meta[
    ["cell_type"] + ad_gene_list
].groupby("cell_type").sum()

sns.heatmap(mean_expression_by_cell_type, cmap="crest")
```

Voilà! A beautiful heatmap showing gene expression levels across cell types.

## In Conclusion...

While this was a simple example, it showcases how CellArr lets you:
1. Quickly combine diverse datasets
2. Efficiently query specific subsets of your data
3. Analyze large collections

For more details, check out the [CellArr documentation](https://github.com/CellArr/cellarr) or reach out for help. Happy analyzing!