Anndata

Name: Anndata
Author: k-dense-ai

k-dense-ai/scientific-agent-skills

858 installs
32k repo stars
Updated July 29, 2026
k-dense-ai/scientific-agent-skills

anndata is a scientific agent skill that teaches memory-efficient AnnData patterns for large single-cell genomics datasets in Python when developers need sparse matrices, categoricals, and scalable agent workflows.

About

anndata in k-dense-ai/scientific-agent-skills documents best practices for Python AnnData objects used in single-cell genomics pipelines inside AI agents. The skill shows how to detect sparsity, convert dense arrays to scipy csr_matrix or csc_matrix when more than 50% of values are zero, and notes AnnData 0.12+ requirements for sparse dtypes. It also covers converting string metadata columns to categoricals to cut memory use, citing 10–100× reductions for sparse genomics data. Developers reach for anndata when agent-assisted notebooks or services must load large expression matrices without exhausting RAM or when naive dense conversions make downstream scanpy-style workflows unusable.

Use sparse matrices (csr/csc) for data with >50% zeros yielding 10-100x memory reduction
Convert string columns to categoricals for 10-50x memory savings on repeated labels
Leverage backed='r' mode to analyze h5ad files larger than available RAM
Distinguish views from copies to prevent unintended data duplication
Apply strings_to_categoricals() across all object columns automatically

Anndata by the numbers

858 all-time installs (skills.sh)
+40 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #332 of 2,065 Data Science & ML skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 29, 2026 (Skillselion catalog sync)

npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill anndata

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/k-dense-ai/scientific-agent-skills/anndata.svg)](https://skillselion.com/skills/k-dense-ai/scientific-agent-skills/anndata)

Installs	858
repo stars	★ 32k
Security audit	3 / 3 scanners passed
Last updated	July 29, 2026
Repository	k-dense-ai/scientific-agent-skills ↗

How do you load large single-cell data memory-efficiently?

Follow memory-efficient patterns when working with large single-cell genomics datasets in Python-based AI agents.

Who is it for?

Python developers and bioinformatics engineers building agent or notebook pipelines over large single-cell AnnData objects.

Skip if: Non-genomics tabular ETL, small in-memory CSV analytics, or teams not using Python AnnData and scipy sparse matrices.

When should I use this skill?

The user works with AnnData, single-cell RNA-seq matrices, sparse genomics data, or RAM errors loading expression tables in Python.

What you get

Memory-tuned AnnData objects with sparse X matrices, categorical obs/var columns, and documented sparsity checks for genomics pipelines.

Memory-optimized AnnData object patterns
Sparsity check and conversion snippets
Categorical metadata configuration examples

By the numbers

10-100x memory reduction cited for sparse genomics data
Sparse conversion threshold above 50% zero values
AnnData 0.12+ sparse dtype requirement noted in examples

Files

SKILL.mdMarkdownGitHub ↗

AnnData

Overview

AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.

When to Use This Skill

Use this skill when:

Creating, reading, or writing AnnData objects
Working with h5ad, zarr, or other genomics data formats
Performing single-cell RNA-seq analysis
Managing large datasets with sparse matrices or backed mode
Concatenating multiple datasets or experimental batches
Subsetting, filtering, or transforming annotated data
Integrating with scanpy, scvi-tools, or other scverse ecosystem tools

Installation

Requires Python 3.11+. Current stable release: 0.12.16 (released 2026-05-18).

uv pip install "anndata==0.12.16"

# Lazy I/O and dask-backed operations
uv pip install "anndata[dask,lazy]==0.12.16"

# Development / docs (contributors)
uv pip install "anndata[dev,test,doc]==0.12.16"

Use unpinned installs only when intentionally tracking the latest compatible release.

Current API notes:

Use anndata.io for non-native read_* and write_* helpers. Top-level anndata.read_h5ad and anndata.read_zarr remain supported.
Avoid deprecated APIs: ad.read, AnnData.concatenate(), AnnData.*_keys(), and anndata.__version__. Prefer ad.read_h5ad, ad.concat, mapping .keys(), and importlib.metadata.version("anndata").
Treat anndata.experimental APIs as useful but unstable. Prefer them for large-data workflows only when their current caveats are acceptable.

Quick Start

Creating an AnnData object

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({
    'cell_type': ['T cell', 'B cell'] * 50,
    'sample': ['A', 'B'] * 50
}, index=[f'cell_{i}' for i in range(100)])

var = pd.DataFrame({
    'gene_name': [f'Gene_{i}' for i in range(2000)]
}, index=[f'ENSG{i:05d}' for i in range(2000)])

adata = ad.AnnData(X=X, obs=obs, var=var)

Reading data

# Native formats (read_h5ad/read_zarr remain at top-level)
adata = ad.read_h5ad('data.h5ad')
adata = ad.read_h5ad('large_data.h5ad', backed='r')  # lazy load for large files
adata = ad.read_zarr('data.zarr')

# Other formats: prefer anndata.io (top-level imports are deprecated)
from anndata.io import read_csv, read_loom, read_mtx

adata = read_csv('data.csv')
adata = read_loom('data.loom')

# 10X Genomics: use scanpy (not anndata) — see scanpy skill
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

Writing data

# Write h5ad file
adata.write_h5ad('output.h5ad')

# Write with compression
adata.write_h5ad('output.h5ad', compression='gzip')

# Write other formats
adata.write_zarr('output.zarr')
adata.write_csvs('output_dir/')

Basic operations

# Subset by conditions
t_cells = adata[adata.obs['cell_type'] == 'T cell']

# Subset by indices
subset = adata[0:50, 0:100]

# Add metadata
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8

# Access dimensions
print(f"{adata.n_obs} observations × {adata.n_vars} variables")

Core Capabilities

1. Data Structure

Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.

See: references/data_structure.md for comprehensive information on:

Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
Creating AnnData objects from various sources
Accessing and manipulating data components
Memory-efficient practices

2. Input/Output Operations

Read and write data in various formats with support for compression, backed mode, and cloud storage.

See: references/io_operations.md for details on:

Native formats (h5ad, zarr)
Alternative formats (CSV, MTX, Loom, 10X, Excel)
Backed mode for large datasets
Remote data access
Format conversion
Performance optimization

Common commands:

from anndata.io import read_mtx

# Read/write h5ad
adata = ad.read_h5ad('data.h5ad', backed='r')
adata.write_h5ad('output.h5ad', compression='gzip')

# 10X Genomics (via scanpy)
import scanpy as sc
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read MTX format
adata = read_mtx('matrix.mtx').T

3. Concatenation

Combine multiple AnnData objects along observations or variables with flexible join strategies.

See: references/concatenation.md for comprehensive coverage of:

Basic concatenation (axis=0 for observations, axis=1 for variables)
Join types (inner, outer)
Merge strategies (same, unique, first, only)
Tracking data sources with labels
Lazy concatenation (AnnCollection)
On-disk concatenation for large datasets

Common commands:

# Concatenate observations (combine samples)
adata = ad.concat(
    [adata1, adata2, adata3],
    axis=0,
    join='inner',
    label='batch',
    keys=['batch1', 'batch2', 'batch3']
)

# Concatenate variables (combine modalities)
adata = ad.concat([adata_rna, adata_protein], axis=1)

# Lazy collection over backed AnnData objects (experimental)
from anndata.experimental import AnnCollection

backed_adatas = [
    ad.read_h5ad(path, backed='r')
    for path in ['data1.h5ad', 'data2.h5ad']
]
collection = AnnCollection(
    backed_adatas,
    join_obs='outer',
    join_vars='inner',
    label='dataset'
)

4. Data Manipulation

Transform, subset, filter, and reorganize data efficiently.

See: references/manipulation.md for detailed guidance on:

Subsetting (by indices, names, boolean masks, metadata conditions)
Transposition
Copying (full copies vs views)
Renaming (observations, variables, categories)
Type conversions (strings to categoricals, sparse/dense)
Adding/removing data components
Reordering
Quality control filtering

Common commands:

# Subset by metadata
filtered = adata[adata.obs['quality_score'] > 0.8]
hv_genes = adata[:, adata.var['highly_variable']]

# Transpose
adata_T = adata.T

# Copy vs view
view = adata[0:100, :]  # View (lightweight reference)
copy = adata[0:100, :].copy()  # Independent copy

# Convert strings to categoricals
adata.strings_to_categoricals()

5. Best Practices

Follow recommended patterns for memory efficiency, performance, and reproducibility.

See: references/best_practices.md for guidelines on:

Memory management (sparse matrices, categoricals, backed mode)
Views vs copies
Data storage optimization
Performance optimization
Working with raw data
Metadata management
Reproducibility
Error handling
Integration with other tools
Common pitfalls and solutions

Key recommendations:

# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

# Convert strings to categoricals
adata.strings_to_categoricals()

# Use backed mode for large files
adata = ad.read_h5ad('large.h5ad', backed='r')

# Store raw before filtering
adata.raw = adata.copy()
adata = adata[:, adata.var['highly_variable']]

Integration with Scverse Ecosystem

AnnData serves as the foundational data structure for the scverse ecosystem:

Scanpy (Single-cell analysis)

import scanpy as sc

# Preprocessing
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata, n_neighbors=15)
sc.tl.umap(adata)
sc.tl.leiden(adata)

# Visualization
sc.pl.umap(adata, color=['cell_type', 'leiden'])

Muon (Multimodal data)

import muon as mu

# Combine RNA and protein data
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})

PyTorch integration

from anndata.experimental import AnnLoader

# Create DataLoader for deep learning
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

for batch in dataloader:
    X = batch.X
    # Train model

Common Workflows

Single-cell RNA-seq analysis

import anndata as ad
import scanpy as sc

# 1. Load data (10X via scanpy; anndata handles h5ad/zarr natively)
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# 2. Quality control
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['n_counts'] = adata.X.sum(axis=1)
adata = adata[adata.obs['n_genes'] > 200]
adata = adata[adata.obs['n_counts'] < 50000]

# 3. Store raw
adata.raw = adata.copy()

# 4. Normalize and filter
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
adata = adata[:, adata.var['highly_variable']]

# 5. Save processed data
adata.write_h5ad('processed.h5ad')

Batch integration

# Load multiple batches
adata1 = ad.read_h5ad('batch1.h5ad')
adata2 = ad.read_h5ad('batch2.h5ad')
adata3 = ad.read_h5ad('batch3.h5ad')

# Concatenate with batch labels
adata = ad.concat(
    [adata1, adata2, adata3],
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='inner'
)

# Apply batch correction
import scanpy as sc
sc.pp.combat(adata, key='batch')

# Continue analysis
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Working with large datasets

# Open in backed mode
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')

# Filter based on metadata (no data loading)
high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset
adata_subset = high_quality.to_memory()

# Process subset
process(adata_subset)

# Or process in chunks
chunk_size = 1000
for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    process(chunk)

Troubleshooting

Out of memory errors

Use backed mode or convert to sparse matrices:

# Backed mode
adata = ad.read_h5ad('file.h5ad', backed='r')

# Sparse matrices
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)

Slow file reading

Use compression and appropriate formats:

# Optimize for storage
adata.strings_to_categoricals()
adata.write_h5ad('file.h5ad', compression='gzip')

# Use Zarr for cloud storage; v3 writes are opt-in in anndata 0.12
import anndata as ad

ad.settings.zarr_write_format = 3
ad.settings.auto_shard_zarr_v3 = True  # experimental; independent of zarr_write_format
adata.write_zarr('file.zarr', chunks=(1000, 1000))

Index alignment issues

Always align external data on index:

# Wrong
adata.obs['new_col'] = external_data['values']

# Correct
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Additional Resources

Official documentation: https://anndata.readthedocs.io/
Scanpy tutorials: https://scanpy.readthedocs.io/
Scverse ecosystem: https://scverse.org/
GitHub repository: https://github.com/scverse/anndata

Best Practices

Guidelines for efficient and effective use of AnnData.

Memory Management

Use sparse matrices for sparse data

import numpy as np
from scipy.sparse import csr_matrix, csc_matrix
import anndata as ad

# Check data sparsity
data = np.random.rand(1000, 2000)
sparsity = 1 - np.count_nonzero(data) / data.size
print(f"Sparsity: {sparsity:.2%}")

# Convert to sparse if >50% zeros (anndata 0.12+ requires csr or csc)
if sparsity > 0.5:
    adata = ad.AnnData(X=csr_matrix(data))
else:
    adata = ad.AnnData(X=data)

# Benefits: 10-100x memory reduction for sparse genomics data

Convert strings to categoricals

# Inefficient: string columns use lots of memory
adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A']

# Efficient: convert to categorical
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')

# Convert all string columns
adata.strings_to_categoricals()

# Benefits: 10-50x memory reduction for repeated strings

Use backed mode for large datasets

# Don't load entire dataset into memory
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')

# Work with metadata
filtered = adata[adata.obs['quality'] > 0.8]

# Load only filtered subset
adata_subset = filtered.to_memory()

# Benefits: Work with datasets larger than RAM

Views vs Copies

Understanding views

# Subsetting creates a view by default
subset = adata[0:100, :]
print(subset.is_view)  # True

# Views don't copy data (memory efficient)
# But modifications can affect original

# Check if object is a view
if adata.is_view:
    adata = adata.copy()  # Make independent

When to use views

# Good: Read-only operations on subsets
mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean()

# Good: Temporary analysis
temp_subset = adata[:100, :]
result = analyze(temp_subset.X)

When to use copies

# Create independent copy for modifications
adata_filtered = adata[keep_cells, :].copy()

# Safe to modify without affecting original
adata_filtered.obs['new_column'] = values

# Always copy when:
# - Storing subset for later use
# - Modifying subset data
# - Passing to function that modifies data

Data Storage Best Practices

Choose the right format

H5AD (HDF5) - Default choice

adata.write_h5ad('data.h5ad', compression='gzip')

Fast random access
Supports backed mode
Good compression
Best for: Most use cases

Zarr - Cloud and parallel access

import anndata

# Default is Zarr v2; opt into v3 for cloud workflows (anndata 0.12+)
anndata.settings.zarr_write_format = 3
anndata.settings.auto_shard_zarr_v3 = True  # experimental; independent of zarr_write_format
adata.write_zarr('data.zarr', chunks=(100, 100))

Excellent for cloud storage (S3, GCS)
Supports parallel I/O and opt-in Zarr v3 sharding (0.12+)
Good compression
Best for: Large datasets, cloud workflows, parallel processing

CSV - Interoperability

adata.write_csvs('output_dir/')

Human readable
Compatible with all tools
Large file sizes, slow
Best for: Sharing with non-Python tools, small datasets

Optimize file size

# Before saving, optimize:

# 1. Convert to sparse if appropriate
from scipy.sparse import csr_matrix, issparse
if not issparse(adata.X):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:
        adata.X = csr_matrix(adata.X)

# 2. Convert strings to categoricals
adata.strings_to_categoricals()

# 3. Use compression
adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)

# Typical results: 5-20x file size reduction

Backed Mode Strategies

Read-only analysis

# Open in read-only backed mode
adata = ad.read_h5ad('data.h5ad', backed='r')

# Perform filtering without loading data
high_quality = adata[adata.obs['quality_score'] > 0.8]

# Load only filtered data
adata_filtered = high_quality.to_memory()

Read-write modifications

# Open in read-write backed mode
adata = ad.read_h5ad('data.h5ad', backed='r+')

# Modify X (persisted to disk)
adata.X[0, 0] = 0

# Metadata changes are not persisted from backed mode; load and write a new file
adata_memory = adata.to_memory()
adata_memory.obs['new_annotation'] = values
adata_memory.write_h5ad('data_with_annotations.h5ad')

Chunked processing

# Process large dataset in chunks
adata = ad.read_h5ad('huge_dataset.h5ad', backed='r')

results = []
chunk_size = 1000

for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    result = process(chunk)
    results.append(result)

final_result = combine(results)

Performance Optimization

Subsetting performance

# Fast: Boolean indexing with arrays
mask = np.array(adata.obs['quality'] > 0.5)
subset = adata[mask, :]

# Slow: Boolean indexing with Series (creates view chain)
subset = adata[adata.obs['quality'] > 0.5, :]

# Fastest: Integer indices
indices = np.where(adata.obs['quality'] > 0.5)[0]
subset = adata[indices, :]

Avoid repeated subsetting

# Inefficient: Multiple subset operations
for cell_type in ['A', 'B', 'C']:
    subset = adata[adata.obs['cell_type'] == cell_type]
    process(subset)

# Efficient: Group and process
groups = adata.obs.groupby('cell_type').groups
for cell_type, indices in groups.items():
    subset = adata[indices, :]
    process(subset)

Use chunked operations for large matrices

# Process X in chunks
for chunk in adata.chunked_X(chunk_size=1000):
    result = compute(chunk)

# More memory efficient than loading full X

Working with Raw Data

Store raw before filtering

# Original data with all genes
adata = ad.AnnData(X=counts)

# Store raw before filtering
adata.raw = adata.copy()

# Filter to highly variable genes
adata = adata[:, adata.var['highly_variable']]

# Later: access original data
original_expression = adata.raw.X
all_genes = adata.raw.var_names

When to use raw

# Use raw for:
# - Differential expression on filtered genes
# - Visualization of specific genes not in filtered set
# - Accessing original counts after normalization

# Access raw data
if adata.raw is not None:
    gene_expr = adata.raw[:, 'GENE_NAME'].X
else:
    gene_expr = adata[:, 'GENE_NAME'].X

Metadata Management

Naming conventions

# Consistent naming improves usability

# Observation metadata (obs):
# - cell_id, sample_id
# - cell_type, tissue, condition
# - n_genes, n_counts, percent_mito
# - cluster, leiden, louvain

# Variable metadata (var):
# - gene_id, gene_name
# - highly_variable, n_cells
# - mean_expression, dispersion

# Embeddings (obsm):
# - X_pca, X_umap, X_tsne
# - X_diffmap, X_draw_graph_fr

# Follow conventions from scanpy/scverse ecosystem

Document metadata

# Store metadata descriptions in uns
adata.uns['metadata_descriptions'] = {
    'cell_type': 'Cell type annotation from automated clustering',
    'quality_score': 'QC score from scrublet (0-1, higher is better)',
    'batch': 'Experimental batch identifier'
}

# Store processing history
adata.uns['processing_steps'] = [
    'Raw counts loaded from 10X',
    'Filtered: n_genes > 200, n_counts < 50000',
    'Normalized to 10000 counts per cell',
    'Log transformed'
]

Reproducibility

Set random seeds

import numpy as np

# Set seed for reproducible results
np.random.seed(42)

# Document in uns
adata.uns['random_seed'] = 42

Store parameters

# Store analysis parameters in uns
adata.uns['pca'] = {
    'n_comps': 50,
    'svd_solver': 'arpack',
    'random_state': 42
}

adata.uns['neighbors'] = {
    'n_neighbors': 15,
    'n_pcs': 50,
    'metric': 'euclidean',
    'method': 'umap'
}

Version tracking

import sys
from importlib.metadata import version

# Store package versions (anndata.__version__ deprecated in 0.12.3)
adata.uns['versions'] = {
    'anndata': version('anndata'),
    'scanpy': version('scanpy'),
    'numpy': version('numpy'),
    'python': sys.version,
}

Error Handling

Check data validity

# Verify dimensions
assert adata.n_obs == len(adata.obs)
assert adata.n_vars == len(adata.var)
assert adata.X.shape == (adata.n_obs, adata.n_vars)

# Check for NaN values
has_nan = np.isnan(adata.X.data).any() if issparse(adata.X) else np.isnan(adata.X).any()
if has_nan:
    print("Warning: Data contains NaN values")

# Check for negative values (if counts expected)
has_negative = (adata.X.data < 0).any() if issparse(adata.X) else (adata.X < 0).any()
if has_negative:
    print("Warning: Data contains negative values")

Validate metadata

# Check for missing values
missing_obs = adata.obs.isnull().sum()
if missing_obs.any():
    print("Missing values in obs:")
    print(missing_obs[missing_obs > 0])

# Verify indices are unique
assert adata.obs_names.is_unique, "Observation names not unique"
assert adata.var_names.is_unique, "Variable names not unique"

# Check metadata alignment
assert len(adata.obs) == adata.n_obs
assert len(adata.var) == adata.n_vars

Integration with Other Tools

Scanpy integration

import scanpy as sc

# AnnData is native format for scanpy
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

Pandas integration

import pandas as pd

# Convert to DataFrame
df = adata.to_df()

# Create from DataFrame
adata = ad.AnnData(df)

# Work with metadata as DataFrames
adata.obs = adata.obs.merge(external_metadata, left_index=True, right_index=True)

PyTorch integration

from anndata.experimental import AnnLoader

# Create PyTorch DataLoader
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)

# Iterate in training loop
for batch in dataloader:
    X = batch.X
    # Train model on batch

Common Pitfalls

Pitfall 1: Modifying views

# Wrong: Modifying view can affect original
subset = adata[:100, :]
subset.X = new_data  # May modify adata.X!

# Correct: Copy before modifying
subset = adata[:100, :].copy()
subset.X = new_data  # Independent copy

Pitfall 2: Index misalignment

# Wrong: Assuming order matches
external_data = pd.read_csv('data.csv')
adata.obs['new_col'] = external_data['values']  # May misalign!

# Correct: Align on index
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']

Pitfall 3: Mixing sparse and dense

# Wrong: Converting sparse to dense uses huge memory
result = adata.X + 1  # Converts sparse to dense!

# Correct: Use sparse operations
from scipy.sparse import issparse
if issparse(adata.X):
    result = adata.X.copy()
    result.data += 1

Pitfall 4: Not handling views

# Wrong: Assuming subset is independent
subset = adata[mask, :]
del adata  # subset may become invalid!

# Correct: Copy when needed
subset = adata[mask, :].copy()
del adata  # subset remains valid

Pitfall 5: Ignoring memory constraints

# Wrong: Loading huge dataset into memory
adata = ad.read_h5ad('100GB_file.h5ad')  # OOM error!

# Correct: Use backed mode
adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
subset = adata[adata.obs['keep']].to_memory()

Workflow Example

Complete best-practices workflow:

import anndata as ad
import numpy as np
from scipy.sparse import csr_matrix, issparse

# 1. Load with backed mode if large
adata = ad.read_h5ad('data.h5ad', backed='r')

# 2. Quick metadata check without loading data
print(f"Dataset: {adata.n_obs} cells × {adata.n_vars} genes")

# 3. Filter based on metadata
high_quality = adata[adata.obs['quality_score'] > 0.8]

# 4. Load filtered subset to memory
adata = high_quality.to_memory()

# 5. Convert to optimal storage types (csr/csc sparse only since 0.12)
adata.strings_to_categoricals()
if not issparse(adata.X):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:
        adata.X = csr_matrix(adata.X)

# 6. Store raw before filtering genes
adata.raw = adata.copy()

# 7. Filter to highly variable genes
adata = adata[:, adata.var['highly_variable']].copy()

# 8. Document processing
adata.uns['processing'] = {
    'filtered': 'quality_score > 0.8',
    'n_hvg': adata.n_vars,
    'date': '2025-11-03'
}

# 9. Save optimized
adata.write_h5ad('processed.h5ad', compression='gzip')

Concatenating AnnData Objects

Combine multiple AnnData objects along either observations or variables axis.

Basic Concatenation

Concatenate along observations (stack cells/samples)

import anndata as ad
import numpy as np

# Create multiple AnnData objects
adata1 = ad.AnnData(X=np.random.rand(100, 50))
adata2 = ad.AnnData(X=np.random.rand(150, 50))
adata3 = ad.AnnData(X=np.random.rand(200, 50))

# Concatenate along observations (axis=0, default)
adata_combined = ad.concat([adata1, adata2, adata3], axis=0)

print(adata_combined.shape)  # (450, 50)

Concatenate along variables (stack genes/features)

# Create objects with same observations, different variables
adata1 = ad.AnnData(X=np.random.rand(100, 50))
adata2 = ad.AnnData(X=np.random.rand(100, 30))
adata3 = ad.AnnData(X=np.random.rand(100, 70))

# Concatenate along variables (axis=1)
adata_combined = ad.concat([adata1, adata2, adata3], axis=1)

print(adata_combined.shape)  # (100, 150)

Join Types

Inner join (intersection)

Keep only variables/observations present in all objects.

import pandas as pd

# Create objects with different variables
adata1 = ad.AnnData(
    X=np.random.rand(100, 50),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
)
adata2 = ad.AnnData(
    X=np.random.rand(150, 60),
    var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
)

# Inner join: only genes 10-49 are kept (overlap)
adata_inner = ad.concat([adata1, adata2], join='inner')
print(adata_inner.n_vars)  # 40 genes (overlap)

Outer join (union)

Keep all variables/observations, filling missing values.

# Outer join: all genes are kept
adata_outer = ad.concat([adata1, adata2], join='outer')
print(adata_outer.n_vars)  # 70 genes (union)

# Missing values are filled with appropriate defaults:
# - 0 for sparse matrices
# - NaN for dense matrices

Fill values for outer joins

# Specify fill value for missing data
adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)

Tracking Data Sources

Add batch labels

# Label which object each observation came from
adata_combined = ad.concat(
    [adata1, adata2, adata3],
    label='batch',  # Column name for labels
    keys=['batch1', 'batch2', 'batch3']  # Labels for each object
)

print(adata_combined.obs['batch'].value_counts())
# batch1    100
# batch2    150
# batch3    200

Automatic batch labels

# If keys not provided, uses integer indices
adata_combined = ad.concat(
    [adata1, adata2, adata3],
    label='dataset'
)
# dataset column contains: 0, 1, 2

Merge Strategies

Control how metadata from different objects is combined using the merge parameter.

merge=None (default for observations)

Exclude metadata on non-concatenation axis.

# When concatenating observations, var metadata must match
adata1.var['gene_type'] = 'protein_coding'
adata2.var['gene_type'] = 'protein_coding'

# var is kept only if identical across all objects
adata_combined = ad.concat([adata1, adata2], merge=None)

merge='same'

Keep metadata that is identical across all objects.

adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
adata1.var['type'] = 'protein_coding'
adata2.var['type'] = 'lncRNA'  # Different

# 'chromosome' is kept (same), 'type' is excluded (different)
adata_combined = ad.concat([adata1, adata2], merge='same')

merge='unique'

Keep metadata columns where each key has exactly one value.

adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]

# gene_id is kept (unique values for each key)
adata_combined = ad.concat([adata1, adata2], merge='unique')

merge='first'

Take values from the first object containing each key.

adata1.var['description'] = ['Desc1'] * 50
adata2.var['description'] = ['Desc2'] * 50

# Uses descriptions from adata1
adata_combined = ad.concat([adata1, adata2], merge='first')

merge='only'

Keep metadata that appears in only one object.

adata1.var['adata1_specific'] = [1] * 50
adata2.var['adata2_specific'] = [2] * 50

# Both metadata columns are kept
adata_combined = ad.concat([adata1, adata2], merge='only')

Handling Index Conflicts

Make indices unique

import pandas as pd

# Create objects with overlapping observation names
adata1 = ad.AnnData(
    X=np.random.rand(3, 10),
    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
)
adata2 = ad.AnnData(
    X=np.random.rand(3, 10),
    obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
)

# Make indices unique by appending batch keys
adata_combined = ad.concat(
    [adata1, adata2],
    label='batch',
    keys=['batch1', 'batch2'],
    index_unique='_'  # Separator for making indices unique
)

print(adata_combined.obs_names)
# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
#  'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']

Concatenating Layers

# Objects with layers
adata1 = ad.AnnData(X=np.random.rand(100, 50))
adata1.layers['normalized'] = np.random.rand(100, 50)
adata1.layers['scaled'] = np.random.rand(100, 50)

adata2 = ad.AnnData(X=np.random.rand(150, 50))
adata2.layers['normalized'] = np.random.rand(150, 50)
adata2.layers['scaled'] = np.random.rand(150, 50)

# Layers are concatenated automatically if present in all objects
adata_combined = ad.concat([adata1, adata2])

print(adata_combined.layers.keys())
# dict_keys(['normalized', 'scaled'])

Concatenating Multi-dimensional Annotations

obsm/varm

# Objects with embeddings
adata1.obsm['X_pca'] = np.random.rand(100, 50)
adata2.obsm['X_pca'] = np.random.rand(150, 50)

# obsm is concatenated along observation axis
adata_combined = ad.concat([adata1, adata2])
print(adata_combined.obsm['X_pca'].shape)  # (250, 50)

obsp/varp (pairwise annotations)

from scipy.sparse import csr_matrix

# Pairwise matrices
adata1.obsp['connectivities'] = csr_matrix((100, 100))
adata2.obsp['connectivities'] = csr_matrix((150, 150))

# By default, obsp is NOT concatenated (set pairwise=True to include)
adata_combined = ad.concat([adata1, adata2])
# adata_combined.obsp is empty

# Include pairwise data (creates block diagonal matrix)
adata_combined = ad.concat([adata1, adata2], pairwise=True)
print(adata_combined.obsp['connectivities'].shape)  # (250, 250)

Concatenating uns (unstructured)

Unstructured metadata is merged recursively:

adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}

# Using merge='unique' for uns
adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
# 'date' is kept (same value), 'batch' might be excluded (different values)

Lazy Concatenation (AnnCollection)

For very large datasets, use AnnCollection to lazily concatenate AnnData objects along the observation axis. This API is experimental; use backed AnnData objects when the inputs are stored in .h5ad files.

import anndata as ad
from anndata.experimental import AnnCollection

files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
backed_adatas = [ad.read_h5ad(path, backed='r') for path in files]

collection = AnnCollection(
    backed_adatas,
    join_obs='outer',
    join_vars='inner',
    label='dataset',
    keys=['dataset1', 'dataset2', 'dataset3']
)

# Access data lazily
print(collection.n_obs)  # Total observations
print(collection.obs.head())  # Metadata loaded, not X

# Convert to regular AnnData when needed (loads all data)
adata = collection.to_adata()

Working with AnnCollection

# Subset without loading data
subset = collection[collection.obs['cell_type'] == 'T cell']

# Iterate through datasets
for adata in collection:
    print(adata.shape)

# Access specific dataset
first_dataset = collection[0]

Concatenation on Disk

For datasets too large for memory, concatenate directly on disk:

import anndata as ad
from anndata.experimental import concat_on_disk

# Concatenate without loading into memory
concat_on_disk(
    ['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
    'combined.h5ad',
    join='outer'
)

# Load result in backed mode
adata = ad.read_h5ad('combined.h5ad', backed='r')

Common Concatenation Patterns

Combine technical replicates

# Multiple runs of the same samples
replicates = [adata_run1, adata_run2, adata_run3]
adata_combined = ad.concat(
    replicates,
    label='technical_replicate',
    keys=['rep1', 'rep2', 'rep3'],
    join='inner'  # Keep only genes measured in all runs
)

Combine batches from experiment

# Different experimental batches
batches = [adata_batch1, adata_batch2, adata_batch3]
adata_combined = ad.concat(
    batches,
    label='batch',
    keys=['batch1', 'batch2', 'batch3'],
    join='outer'  # Keep all genes
)

# Later: apply batch correction

Merge multi-modal data

# Different measurement modalities (e.g., RNA + protein)
adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
adata_protein = ad.AnnData(X=np.random.rand(100, 50))

# Concatenate along variables to combine modalities
adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)

# Add labels to distinguish modalities
adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50

Best Practices

1. Check compatibility before concatenating

# Verify shapes are compatible
print([adata.n_vars for adata in [adata1, adata2, adata3]])

# Check variable names match
print([set(adata.var_names) for adata in [adata1, adata2, adata3]])

2. Use appropriate join type

inner: When you need the same features across all samples (most stringent)
outer: When you want to preserve all features (most inclusive)

3. Track data sources Always use label and keys to track which observations came from which dataset.

4. Consider memory usage

For large datasets, use AnnCollection or concat_on_disk
Consider backed mode for the result

5. Handle batch effects Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:

# After concatenation, apply batch correction
import scanpy as sc
sc.pp.combat(adata_combined, key='batch')

6. Validate results

# Check dimensions
print(adata_combined.shape)

# Check batch distribution
print(adata_combined.obs['batch'].value_counts())

# Verify metadata integrity
print(adata_combined.var.head())
print(adata_combined.obs.head())

AnnData Object Structure

The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.

Core Components

X (Data Matrix)

The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.

import anndata as ad
import numpy as np

# Create with dense array
adata = ad.AnnData(X=np.random.rand(100, 2000))

# Create with sparse matrix (recommended for large, sparse data)
from scipy.sparse import csr_matrix
sparse_data = csr_matrix(np.random.rand(100, 2000))
adata = ad.AnnData(X=sparse_data)

Access data:

# Full matrix (caution with large datasets)
full_data = adata.X

# Single observation
obs_data = adata.X[0, :]

# Single variable across all observations
var_data = adata.X[:, 0]

obs (Observation Annotations)

DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.

import pandas as pd

# Create AnnData with observation metadata
obs_df = pd.DataFrame({
    'cell_type': ['T cell', 'B cell', 'Monocyte'],
    'treatment': ['control', 'treated', 'control'],
    'timepoint': [0, 24, 24]
}, index=['cell_1', 'cell_2', 'cell_3'])

adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)

# Access observation metadata
print(adata.obs['cell_type'])
print(adata.obs.loc['cell_1'])

var (Variable Annotations)

DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.

# Create AnnData with variable metadata
var_df = pd.DataFrame({
    'gene_name': ['ACTB', 'GAPDH', 'TP53'],
    'chromosome': ['7', '12', '17'],
    'highly_variable': [True, False, True]
}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])

adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)

# Access variable metadata
print(adata.var['gene_name'])
print(adata.var.loc['ENSG00001'])

layers (Alternative Data Representations)

Dictionary storing alternative matrices with the same dimensions as X.

# Store raw counts, normalized data, and scaled data
adata = ad.AnnData(X=np.random.rand(100, 2000))
adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()

# Access layers
raw_data = adata.layers['raw_counts']
normalized_data = adata.layers['normalized']

Common layer uses:

raw_counts: Original count data before normalization
normalized: Log-normalized or TPM values
scaled: Z-scored values for analysis
imputed: Data after imputation

obsm (Multi-dimensional Observation Annotations)

Dictionary storing multi-dimensional arrays aligned to observations.

# Store PCA coordinates and UMAP embeddings
adata.obsm['X_pca'] = np.random.rand(100, 50)  # 50 principal components
adata.obsm['X_umap'] = np.random.rand(100, 2)  # 2D UMAP coordinates
adata.obsm['X_tsne'] = np.random.rand(100, 2)  # 2D t-SNE coordinates

# Access embeddings
pca_coords = adata.obsm['X_pca']
umap_coords = adata.obsm['X_umap']

Common obsm uses:

X_pca: Principal component coordinates
X_umap: UMAP embedding coordinates
X_tsne: t-SNE embedding coordinates
X_diffmap: Diffusion map coordinates
protein_expression: Protein abundance measurements (CITE-seq)

varm (Multi-dimensional Variable Annotations)

Dictionary storing multi-dimensional arrays aligned to variables.

# Store PCA loadings
adata.varm['PCs'] = np.random.rand(2000, 50)  # Loadings for 50 components
adata.varm['gene_modules'] = np.random.rand(2000, 10)  # Gene module scores

# Access loadings
pc_loadings = adata.varm['PCs']

Common varm uses:

PCs: Principal component loadings
gene_modules: Gene co-expression module assignments

obsp (Pairwise Observation Relationships)

Dictionary storing sparse matrices representing relationships between observations.

from scipy.sparse import csr_matrix

# Store k-nearest neighbor graph
n_obs = 100
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
adata.obsp['connectivities'] = knn_graph
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))

# Access graphs
knn_connections = adata.obsp['connectivities']
distances = adata.obsp['distances']

Common obsp uses:

connectivities: Cell-cell neighborhood graph
distances: Pairwise distances between cells

varp (Pairwise Variable Relationships)

Dictionary storing sparse matrices representing relationships between variables.

# Store gene-gene correlation matrix
n_vars = 2000
gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
adata.varp['correlations'] = gene_corr

# Access correlations
gene_correlations = adata.varp['correlations']

uns (Unstructured Annotations)

Dictionary storing arbitrary unstructured metadata.

# Store analysis parameters and results
adata.uns['experiment_date'] = '2025-11-03'
adata.uns['pca'] = {
    'variance_ratio': [0.15, 0.10, 0.08],
    'params': {'n_comps': 50}
}
adata.uns['neighbors'] = {
    'params': {'n_neighbors': 15, 'method': 'umap'},
    'connectivities_key': 'connectivities'
}

# Access unstructured data
exp_date = adata.uns['experiment_date']
pca_params = adata.uns['pca']['params']

Common uns uses:

Analysis parameters and settings
Color palettes for plotting
Cluster information
Tool-specific metadata

raw (Original Data Snapshot)

Optional attribute preserving the original data matrix and variable annotations before filtering.

# Create AnnData and store raw state
adata = ad.AnnData(X=np.random.rand(100, 5000))
adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]

# Store raw state before filtering
adata.raw = adata.copy()

# Filter to highly variable genes
highly_variable_mask = np.random.rand(5000) > 0.5
adata = adata[:, highly_variable_mask]

# Access original data
original_matrix = adata.raw.X
original_var = adata.raw.var

Object Properties

# Dimensions
n_observations = adata.n_obs
n_variables = adata.n_vars
shape = adata.shape  # (n_obs, n_vars)

# Index information
obs_names = adata.obs_names  # Observation identifiers
var_names = adata.var_names  # Variable identifiers

# Storage mode
is_view = adata.is_view  # True if this is a view of another object
is_backed = adata.isbacked  # True if backed by on-disk storage
filename = adata.filename  # Path to backing file (if backed)

Creating AnnData Objects

From arrays and DataFrames

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
adata = ad.AnnData(X=X, obs=obs, var=var)

# With all components
adata = ad.AnnData(
    X=X,
    obs=obs,
    var=var,
    layers={'raw': np.random.randint(0, 100, (100, 2000))},
    obsm={'X_pca': np.random.rand(100, 50)},
    uns={'experiment': 'test'}
)

From DataFrame

# Create from pandas DataFrame (genes as columns, cells as rows)
df = pd.DataFrame(
    np.random.rand(100, 50),
    columns=[f'Gene_{i}' for i in range(50)],
    index=[f'Cell_{i}' for i in range(100)]
)
adata = ad.AnnData(df)

Data Access Patterns

Vector extraction

# Get observation annotation as array
cell_types = adata.obs_vector('cell_type')

# Get variable values across observations
gene_expression = adata.obs_vector('ACTB')  # If ACTB is in var_names

# Get variable annotation as array
gene_names = adata.var_vector('gene_name')

Subsetting

# By index
subset = adata[0:10, 0:100]  # First 10 obs, first 100 vars

# By name
subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]

# By boolean mask
high_count_cells = adata.obs['total_counts'] > 1000
subset = adata[high_count_cells, :]

# By observation metadata
t_cells = adata[adata.obs['cell_type'] == 'T cell']

Memory Considerations

The AnnData structure is designed for memory efficiency:

Sparse matrices reduce memory for sparse data
Views avoid copying data when possible
Backed mode enables working with data larger than RAM
Categorical annotations reduce memory for discrete values

# Convert strings to categoricals (more memory efficient)
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
adata.strings_to_categoricals()

# Check if object is a view (doesn't own data)
if adata.is_view:
    adata = adata.copy()  # Create independent copy

Input/Output Operations

AnnData provides comprehensive I/O functionality for reading and writing data in various formats.

Since anndata 0.11, most read_* and write_* functions live in anndata.io. Top-level read_h5ad and read_zarr remain at anndata without deprecation warnings; other top-level imports still work but emit FutureWarning.

import anndata as ad
from anndata.io import read_csv, read_mtx, read_loom, read_elem, write_elem

Avoid deprecated I/O aliases such as ad.read; use ad.read_h5ad or anndata.io.read_h5ad explicitly.

Native Formats

H5AD (HDF5-based)

The recommended native format for AnnData objects, providing efficient storage and fast access.

Writing H5AD files

import anndata as ad

# Write to file
adata.write_h5ad('data.h5ad')

# Write with compression
adata.write_h5ad('data.h5ad', compression='gzip')

# Write with specific compression level (0-9, higher = more compression)
adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)

Reading H5AD files

# Read entire file into memory
adata = ad.read_h5ad('data.h5ad')

# Read in backed mode (lazy loading for large files)
adata = ad.read_h5ad('data.h5ad', backed='r')  # Read-only
adata = ad.read_h5ad('data.h5ad', backed='r+')  # Read-write for X

# Backed mode enables working with datasets larger than RAM
# Only accessed data is loaded into memory
# In backed mode, only X updates are persisted; write a new file for obs/var/uns changes.

Backed mode operations

# Open in backed mode
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')

# Access metadata without loading X into memory
print(adata.obs.head())
print(adata.var.head())

# Subset operations create views
subset = adata[:100, :500]  # View, no data loaded

# Load specific data into memory
X_subset = subset.X[:]  # Now loads this subset

# Convert entire backed object to memory
adata_memory = adata.to_memory()

Zarr

Hierarchical array storage format, optimized for cloud storage and parallel I/O.

Writing Zarr

# Write to Zarr store
adata.write_zarr('data.zarr')

# Write with specific chunks (important for performance)
adata.write_zarr('data.zarr', chunks=(100, 100))

Reading Zarr

# Read Zarr store
adata = ad.read_zarr('data.zarr')

Zarr v3 (anndata 0.12+)

import anndata

# Default writes Zarr v2; opt into v3 and optional auto-sharding
anndata.settings.zarr_write_format = 3
anndata.settings.auto_shard_zarr_v3 = True  # experimental; independent of zarr_write_format

adata.write_zarr('data.zarr', chunks=(1000, 1000))

Zarr v3 writing is available in anndata 0.12, with structured-array exceptions and evolving performance guidance. Consolidated metadata is recommended for remote Zarr stores.

Remote Zarr access

Only open remote stores from trusted, expected locations. Prefer allowlisted HTTPS/S3/GCS paths or signed URLs, and avoid asking an agent to fetch arbitrary user-supplied URLs.

import fsspec

# Access Zarr from an expected S3 location
store = fsspec.get_mapper('s3://bucket-name/data.zarr')
adata = ad.read_zarr(store)

# Access Zarr from a trusted HTTPS location
store = fsspec.get_mapper('https://example.com/data.zarr')
adata = ad.read_zarr(store)

Alternative Input Formats

CSV/TSV

from anndata.io import read_csv

# Read CSV (genes as columns, cells as rows)
adata = read_csv('data.csv')

# Read with custom delimiter
adata = read_csv('data.tsv', delimiter='\t')

# Specify that first column is row names
adata = read_csv('data.csv', first_column_names=True)

Excel

from anndata.io import read_excel

# Read Excel file
adata = read_excel('data.xlsx')

# Read specific sheet
adata = read_excel('data.xlsx', sheet='Sheet1')

Matrix Market (MTX)

Common format for sparse matrices in genomics.

from anndata.io import read_mtx

# Read MTX with associated files
# Requires: matrix.mtx, genes.tsv, barcodes.tsv
adata = read_mtx('matrix.mtx')

# Read with custom gene and barcode files
adata = read_mtx(
    'matrix.mtx',
    var_names='genes.tsv',
    obs_names='barcodes.tsv'
)

# Transpose if needed (MTX often has genes as rows)
adata = adata.T

10X Genomics formats

10X readers are provided by scanpy, not anndata. After loading, the result is a standard AnnData object.

import scanpy as sc

# Read 10X h5 format
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Read 10X MTX directory
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

# Specify genome if multiple present
adata = sc.read_10x_h5('data.h5', genome='GRCh38')

Loom

from anndata.io import read_loom

# Read Loom file
adata = read_loom('data.loom')

# Read with specific observation and variable annotations
adata = read_loom(
    'data.loom',
    obs_names='CellID',
    var_names='Gene'
)

Text files

from anndata.io import read_text

# Read generic text file
adata = read_text('data.txt', delimiter='\t')

# Read with custom parameters
adata = read_text(
    'data.txt',
    delimiter=',',
    first_column_names=True,
    dtype='float32'
)

UMI tools

from anndata.io import read_umi_tools

# Read UMI tools format
adata = read_umi_tools('counts.tsv')

HDF5 (generic)

from anndata.io import read_hdf

# Read from HDF5 file (not h5ad format)
adata = read_hdf('data.h5', key='dataset')

Alternative Output Formats

CSV

# Write to CSV files (creates multiple files)
adata.write_csvs('output_dir/')

# This creates:
# - output_dir/X.csv (expression matrix)
# - output_dir/obs.csv (observation annotations)
# - output_dir/var.csv (variable annotations)
# - output_dir/uns.csv (unstructured annotations, if possible)

# Skip certain components
adata.write_csvs('output_dir/', skip_data=True)  # Skip X matrix

Loom

# Write to Loom format
adata.write_loom('output.loom')

Reading Specific Elements

For fine-grained control, read specific elements from an open store:

import h5py
from anndata.io import read_elem

# Read just observation annotations from an h5ad file
with h5py.File('data.h5ad', 'r') as f:
    obs = read_elem(f['obs'])
    layer = read_elem(f['layers/normalized'])
    params = read_elem(f['uns/pca'])

Writing Specific Elements

from anndata.io import write_elem
import h5py

# Write element to existing file
with h5py.File('data.h5ad', 'a') as f:
    write_elem(f, 'new_layer', adata.X.copy())

Lazy Operations

For very large datasets, use lazy reading to avoid loading entire datasets. read_lazy is experimental and is designed for on-disk or in-cloud AnnData stores, including lazy obs and var access.

from anndata.experimental import read_lazy

adata = read_lazy('large_data.zarr')
print(adata.obs.head())  # Does not require loading X

For element-level control, use read_elem_lazy on an open store:

import h5py
from anndata.experimental import read_elem_lazy

# Lazy read from an open store (returns dask-backed array)
with h5py.File('large_data.h5ad', 'r') as f:
    X_lazy = read_elem_lazy(f['X'])
    subset = X_lazy[:100, :100].compute()

Common I/O Patterns

Convert between formats

from anndata.io import read_mtx, read_csv

# MTX to H5AD
adata = read_mtx('matrix.mtx').T
adata.write_h5ad('data.h5ad')

# CSV to H5AD
adata = read_csv('data.csv')
adata.write_h5ad('data.h5ad')

# H5AD to Zarr
adata = ad.read_h5ad('data.h5ad')
adata.write_zarr('data.zarr')

Load metadata without data

# Backed mode allows inspecting metadata without loading X
adata = ad.read_h5ad('large_file.h5ad', backed='r')
print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
print(adata.obs.columns)
print(adata.var.columns)
# X is not loaded into memory

Update backed data or write a new file

# Open in read-write mode for X updates
adata = ad.read_h5ad('data.h5ad', backed='r+')

# X updates can be persisted in backed mode
adata.X[0, 0] = 0

# Metadata changes are not persisted from backed mode; write a new file instead
adata_memory = adata.to_memory()
adata_memory.obs['new_column'] = values
adata_memory.write_h5ad('data_with_metadata.h5ad')

Download from a trusted URL

Validate remote sources before downloading. Prefer local files or vetted object-store paths over arbitrary URLs.

import anndata as ad
import urllib.request
from urllib.parse import urlparse

url = 'https://example.org/datasets/reference.h5ad'
parsed = urlparse(url)
trusted_hosts = {'example.org'}

if parsed.scheme != 'https' or parsed.netloc not in trusted_hosts:
    raise ValueError('Refusing to download from an untrusted host')

urllib.request.urlretrieve(url, 'reference.h5ad')
adata = ad.read_h5ad('reference.h5ad')

Performance Tips

Reading

Use backed='r' for large files you only need to query
Use backed='r+' only for X updates; write a new file for metadata changes
H5AD format is generally fastest for random access
Zarr is better for cloud storage and parallel access
Consider compression for storage, but note it may slow down reading

Writing

Use compression for long-term storage: compression='gzip' or compression='lzf'
LZF compression is faster but compresses less than GZIP
For Zarr, tune chunk sizes based on access patterns:
Larger chunks for sequential reads
Smaller chunks for random access
Convert string columns to categorical before writing (smaller files)

Memory management

# Convert strings to categoricals (reduces file size and memory)
adata.strings_to_categoricals()
adata.write_h5ad('data.h5ad')

# Use sparse matrices for sparse data
from scipy.sparse import csr_matrix
if isinstance(adata.X, np.ndarray):
    density = np.count_nonzero(adata.X) / adata.X.size
    if density < 0.5:  # If more than 50% zeros
        adata.X = csr_matrix(adata.X)

Handling Large Datasets

Strategy 1: Backed mode

# Work with dataset larger than RAM
adata = ad.read_h5ad('100GB_file.h5ad', backed='r')

# Filter based on metadata (fast, no data loading)
filtered = adata[adata.obs['quality_score'] > 0.8]

# Load filtered subset into memory
adata_memory = filtered.to_memory()

Strategy 2: Chunked processing

# Process data in chunks
adata = ad.read_h5ad('large_file.h5ad', backed='r')

chunk_size = 1000
results = []

for i in range(0, adata.n_obs, chunk_size):
    chunk = adata[i:i+chunk_size, :].to_memory()
    # Process chunk
    result = process(chunk)
    results.append(result)

Strategy 3: Use AnnCollection

import anndata as ad
from anndata.experimental import AnnCollection

# Create backed objects, then lazily concatenate along observations
adatas = [
    ad.read_h5ad(f'dataset_{i}.h5ad', backed='r')
    for i in range(10)
]
collection = AnnCollection(
    adatas,
    join_obs='inner',
    join_vars='inner'
)

# Process collection lazily
# Data is loaded only when accessed

Common Issues and Solutions

Issue: Out of memory when reading

Solution: Use backed mode or read in chunks

adata = ad.read_h5ad('file.h5ad', backed='r')

Issue: Slow reading from cloud storage

Solution: Use Zarr format with appropriate chunking

adata.write_zarr('data.zarr', chunks=(1000, 1000))

Issue: Large file sizes

Solution: Use compression and convert to sparse/categorical

adata.strings_to_categoricals()
from scipy.sparse import csr_matrix
adata.X = csr_matrix(adata.X)
adata.write_h5ad('compressed.h5ad', compression='gzip')

Issue: Cannot modify backed metadata

Solution: Load to memory and write a new file. Backed mode only persists updates to X.

adata = adata.to_memory()
adata.obs['new_column'] = values
adata.write_h5ad('updated_file.h5ad')

Data Manipulation

Operations for transforming, subsetting, and manipulating AnnData objects.

Subsetting

By indices

import anndata as ad
import numpy as np

adata = ad.AnnData(X=np.random.rand(1000, 2000))

# Integer indices
subset = adata[0:100, 0:500]  # First 100 obs, first 500 vars

# List of indices
obs_indices = [0, 10, 20, 30, 40]
var_indices = [0, 1, 2, 3, 4]
subset = adata[obs_indices, var_indices]

# Single observation or variable
single_obs = adata[0, :]
single_var = adata[:, 0]

By names

import pandas as pd

# Create with named indices
obs_names = [f'cell_{i}' for i in range(1000)]
var_names = [f'gene_{i}' for i in range(2000)]
adata = ad.AnnData(
    X=np.random.rand(1000, 2000),
    obs=pd.DataFrame(index=obs_names),
    var=pd.DataFrame(index=var_names)
)

# Subset by observation names
subset = adata[['cell_0', 'cell_1', 'cell_2'], :]

# Subset by variable names
subset = adata[:, ['gene_0', 'gene_10', 'gene_20']]

# Both axes
subset = adata[['cell_0', 'cell_1'], ['gene_0', 'gene_1']]

By boolean masks

# Create boolean masks
high_count_obs = np.random.rand(1000) > 0.5
high_var_genes = np.random.rand(2000) > 0.7

# Subset using masks
subset = adata[high_count_obs, :]
subset = adata[:, high_var_genes]
subset = adata[high_count_obs, high_var_genes]

By metadata conditions

# Add metadata
adata.obs['cell_type'] = np.random.choice(['A', 'B', 'C'], 1000)
adata.obs['quality_score'] = np.random.rand(1000)
adata.var['highly_variable'] = np.random.rand(2000) > 0.8

# Filter by cell type
t_cells = adata[adata.obs['cell_type'] == 'A']

# Filter by multiple conditions
high_quality_a_cells = adata[
    (adata.obs['cell_type'] == 'A') &
    (adata.obs['quality_score'] > 0.7)
]

# Filter by variable metadata
hv_genes = adata[:, adata.var['highly_variable']]

# Complex conditions
filtered = adata[
    (adata.obs['quality_score'] > 0.5) &
    (adata.obs['cell_type'].isin(['A', 'B'])),
    adata.var['highly_variable']
]

Transposition

# Transpose AnnData object (swap observations and variables)
adata_T = adata.T

# Shape changes
print(adata.shape)    # (1000, 2000)
print(adata_T.shape)  # (2000, 1000)

# obs and var are swapped
print(adata.obs.head())   # Observation metadata
print(adata_T.var.head()) # Same data, now as variable metadata

# Useful when data is in opposite orientation
# Common with some file formats where genes are rows

Copying

Full copy

# Create independent copy
adata_copy = adata.copy()

# Modifications to copy don't affect original
adata_copy.obs['new_column'] = 1
print('new_column' in adata.obs.columns)  # False

Shallow copy

# View (doesn't copy data, modifications affect original)
adata_view = adata[0:100, :]

# Check if object is a view
print(adata_view.is_view)  # True

# Convert view to independent copy
adata_independent = adata_view.copy()
print(adata_independent.is_view)  # False

Renaming

Rename observations and variables

# Rename all observations
adata.obs_names = [f'new_cell_{i}' for i in range(adata.n_obs)]

# Rename all variables
adata.var_names = [f'new_gene_{i}' for i in range(adata.n_vars)]

# Make names unique (add suffix to duplicates)
adata.obs_names_make_unique()
adata.var_names_make_unique()

Rename categories

# Create categorical column
adata.obs['cell_type'] = pd.Categorical(['A', 'B', 'C'] * 333 + ['A'])

# Rename categories
adata.rename_categories('cell_type', ['Type_A', 'Type_B', 'Type_C'])

# Or using dictionary
adata.rename_categories('cell_type', {
    'Type_A': 'T_cell',
    'Type_B': 'B_cell',
    'Type_C': 'Monocyte'
})

Type Conversions

Strings to categoricals

# Convert string columns to categorical (more memory efficient)
adata.obs['cell_type'] = ['TypeA', 'TypeB'] * 500
adata.obs['tissue'] = ['brain', 'liver'] * 500

# Convert all string columns to categorical
adata.strings_to_categoricals()

print(adata.obs['cell_type'].dtype)  # category
print(adata.obs['tissue'].dtype)     # category

Sparse to dense and vice versa

from scipy.sparse import csr_matrix, issparse

# Dense to sparse
if not issparse(adata.X):
    adata.X = csr_matrix(adata.X)

# Sparse to dense
if issparse(adata.X):
    adata.X = adata.X.toarray()

# Convert layer
adata.layers['normalized'] = csr_matrix(adata.layers['normalized'])

Chunked Operations

Process large datasets in chunks:

# Iterate through data in chunks
chunk_size = 100
for chunk in adata.chunked_X(chunk_size):
    # Process chunk
    result = process_chunk(chunk)

Extracting Vectors

Get observation vectors

# Get observation metadata as array
cell_types = adata.obs_vector('cell_type')

# Get gene expression across observations
actb_expression = adata.obs_vector('ACTB')  # If ACTB in var_names

Get variable vectors

# Get variable metadata as array
gene_names = adata.var_vector('gene_name')

Adding/Modifying Data

Add observations

# Create new observations
new_obs = ad.AnnData(X=np.random.rand(100, adata.n_vars))
new_obs.var_names = adata.var_names

# Concatenate with existing
adata_extended = ad.concat([adata, new_obs], axis=0)

Add variables

# Create new variables
new_vars = ad.AnnData(X=np.random.rand(adata.n_obs, 100))
new_vars.obs_names = adata.obs_names

# Concatenate with existing
adata_extended = ad.concat([adata, new_vars], axis=1)

Add metadata columns

# Add observation annotation
adata.obs['new_score'] = np.random.rand(adata.n_obs)

# Add variable annotation
adata.var['new_label'] = ['label'] * adata.n_vars

# Add from external data
external_data = pd.read_csv('metadata.csv', index_col=0)
adata.obs['external_info'] = external_data.loc[adata.obs_names, 'column']

Add layers

# Add new layer
adata.layers['raw_counts'] = np.random.randint(0, 100, adata.shape)
adata.layers['log_transformed'] = np.log1p(adata.X)

# Replace layer
adata.layers['normalized'] = new_normalized_data

Add embeddings

# Add PCA
adata.obsm['X_pca'] = np.random.rand(adata.n_obs, 50)

# Add UMAP
adata.obsm['X_umap'] = np.random.rand(adata.n_obs, 2)

# Add multiple embeddings
adata.obsm['X_tsne'] = np.random.rand(adata.n_obs, 2)
adata.obsm['X_diffmap'] = np.random.rand(adata.n_obs, 10)

Add pairwise relationships

from scipy.sparse import csr_matrix

# Add nearest neighbor graph
n_obs = adata.n_obs
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
adata.obsp['connectivities'] = knn_graph

# Add distance matrix
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))

Add unstructured data

# Add analysis parameters
adata.uns['pca'] = {
    'variance': [0.2, 0.15, 0.1],
    'variance_ratio': [0.4, 0.3, 0.2],
    'params': {'n_comps': 50}
}

# Add color schemes
adata.uns['cell_type_colors'] = ['#FF0000', '#00FF00', '#0000FF']

Removing Data

Remove observations or variables

# Keep only specific observations
keep_obs = adata.obs['quality_score'] > 0.5
adata = adata[keep_obs, :]

# Remove specific variables
remove_vars = adata.var['low_count']
adata = adata[:, ~remove_vars]

Remove metadata columns

# Remove observation column
adata.obs.drop('unwanted_column', axis=1, inplace=True)

# Remove variable column
adata.var.drop('unwanted_column', axis=1, inplace=True)

Remove layers

# Remove specific layer
del adata.layers['unwanted_layer']

# Remove all layers
adata.layers = {}

Remove embeddings

# Remove specific embedding
del adata.obsm['X_tsne']

# Remove all embeddings
adata.obsm = {}

Remove unstructured data

# Remove specific key
del adata.uns['unwanted_key']

# Remove all unstructured data
adata.uns = {}

Reordering

Sort observations

# Sort by observation metadata
adata = adata[adata.obs.sort_values('quality_score').index, :]

# Sort by observation names
adata = adata[sorted(adata.obs_names), :]

Sort variables

# Sort by variable metadata
adata = adata[:, adata.var.sort_values('gene_name').index]

# Sort by variable names
adata = adata[:, sorted(adata.var_names)]

Reorder to match external list

# Reorder observations to match external list
desired_order = ['cell_10', 'cell_5', 'cell_20', ...]
adata = adata[desired_order, :]

# Reorder variables
desired_genes = ['TP53', 'ACTB', 'GAPDH', ...]
adata = adata[:, desired_genes]

Data Transformations

Normalize

# Total count normalization (CPM/TPM-like)
total_counts = adata.X.sum(axis=1)
adata.layers['normalized'] = adata.X / total_counts[:, np.newaxis] * 1e6

# Log transformation
adata.layers['log1p'] = np.log1p(adata.X)

# Z-score normalization
mean = adata.X.mean(axis=0)
std = adata.X.std(axis=0)
adata.layers['scaled'] = (adata.X - mean) / std

Filter

# Filter cells by total counts
total_counts = np.array(adata.X.sum(axis=1)).flatten()
adata.obs['total_counts'] = total_counts
adata = adata[adata.obs['total_counts'] > 1000, :]

# Filter genes by detection rate
detection_rate = (adata.X > 0).sum(axis=0) / adata.n_obs
adata.var['detection_rate'] = np.array(detection_rate).flatten()
adata = adata[:, adata.var['detection_rate'] > 0.01]

Working with Views

Views are lightweight references to subsets of data that don't copy the underlying matrix:

# Create view
view = adata[0:100, 0:500]
print(view.is_view)  # True

# Views allow read access
data = view.X

# Modifying view data affects original
# (Be careful!)

# Convert view to independent copy
independent = view.copy()

# Force AnnData to be a copy, not a view
adata = adata.copy()

Merging Metadata

# Merge external metadata
external_metadata = pd.read_csv('additional_metadata.csv', index_col=0)

# Join metadata (inner join on index)
adata.obs = adata.obs.join(external_metadata)

# Left join (keep all adata observations)
adata.obs = adata.obs.merge(
    external_metadata,
    left_index=True,
    right_index=True,
    how='left'
)

Common Manipulation Patterns

Quality control filtering

# Calculate QC metrics
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
adata.obs['total_counts'] = adata.X.sum(axis=1)
adata.var['n_cells'] = (adata.X > 0).sum(axis=0)

# Filter low-quality cells
adata = adata[adata.obs['n_genes'] > 200, :]
adata = adata[adata.obs['total_counts'] < 50000, :]

# Filter rarely detected genes
adata = adata[:, adata.var['n_cells'] >= 3]

Select highly variable genes

# Mark highly variable genes
gene_variance = np.var(adata.X, axis=0)
adata.var['variance'] = np.array(gene_variance).flatten()
adata.var['highly_variable'] = adata.var['variance'] > np.percentile(gene_variance, 90)

# Subset to highly variable genes
adata_hvg = adata[:, adata.var['highly_variable']].copy()

Downsample

# Random sampling of observations
np.random.seed(42)
n_sample = 500
sample_indices = np.random.choice(adata.n_obs, n_sample, replace=False)
adata_downsampled = adata[sample_indices, :].copy()

# Stratified sampling by cell type
from sklearn.model_selection import train_test_split
train_idx, test_idx = train_test_split(
    range(adata.n_obs),
    test_size=0.2,
    stratify=adata.obs['cell_type']
)
adata_train = adata[train_idx, :].copy()
adata_test = adata[test_idx, :].copy()

Split train/test

# Random train/test split
np.random.seed(42)
n_obs = adata.n_obs
train_size = int(0.8 * n_obs)
indices = np.random.permutation(n_obs)
train_indices = indices[:train_size]
test_indices = indices[train_size:]

adata_train = adata[train_indices, :].copy()
adata_test = adata[test_indices, :].copy()

Related skills

Microsoft FoundryDeploy, evaluate, and continuously improve Microsoft Foundry agents from a single agent interface.478k1.3k

Ai Research ReproductionOrchestrate trustworthy, auditable reproduction of deep learning repositories directly from their READMEs.164k507

Run TrainSafely execute selected deep learning training commands with standardized evidence capture.164k507

Explore RunSafely run isolated exploratory experiments with clear recording and conservative selection before committing changes.164k507

Paper Context ResolverFetch precise reproduction-critical details like dataset splits, preprocessing steps, or evaluation protocols from the original academic paper when the repo README leav141k507

Repo Intake And PlanScan unfamiliar AI research repositories and receive a minimal, trustworthy reproduction target before investing significant time.140k507

How it compares

Use anndata for AnnData-specific genomics memory patterns; use generic pandas skills for non-genomic tabular data without single-cell matrix structure.

FAQ

When should AnnData use sparse matrices according to the anndata skill?

The anndata skill recommends converting to scipy csr_matrix or csc_matrix when more than 50% of values are zero, yielding 10-100x memory savings on sparse genomics data. AnnData 0.12+ expects compatible sparse dtypes for the X matrix.

How does the anndata skill reduce metadata memory?

The anndata skill advises converting string columns in obs and var to pandas categoricals instead of plain strings. That pattern lowers memory footprint when agents load large single-cell metadata alongside expression matrices.

Is Anndata safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Data Science & MLagentsautomation