
Anndata
Teach your coding agent AnnData patterns so single-cell and matrix-backed Python pipelines use less RAM and avoid silent copy bugs.
Overview
AnnData is an agent skill for the Build phase that guides efficient Python use of AnnData objects—sparse matrices, categoricals, backed H5AD I/O, and safe views versus copies.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill anndataWhat is this skill?
- Chooses csr/csc sparse X when sparsity exceeds ~50% for 10–100× memory wins on genomics-scale matrices
- Converts repeated obs/var strings to categoricals via strings_to_categoricals for 10–50× label storage savings
- Uses backed='r' h5ad reads and to_memory() only on filtered subsets to work beyond RAM
- Explains views vs copies and when mutations require .copy() to prevent accidental parent updates
- Sparse X when sparsity >50% targets 10–100× memory reduction versus dense storage
- Categorical obs/var strings target 10–50× memory reduction for repeated labels
- Backed read mode supports working with H5AD larger than available RAM
Adoption & trust: 525 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are building single-cell or matrix-heavy Python pipelines but your agent loads huge dense arrays, wastes memory on string metadata, or mutates unintended AnnData views.
Who is it for?
Indie builders maintaining Python scRNA-seq or tabular-matrix tooling who want agents to write production-shaped AnnData code without a bioinformatics copilot on call.
Skip if: Teams that only need high-level biology interpretation with no AnnData/Python implementation, or projects that do not use the anndata library at all.
When should I use this skill?
Implementing or refactoring Python pipelines that create, subset, or persist AnnData objects for matrix-backed scientific data.
What do I get? / Deliverables
After the skill runs, generated code follows AnnData memory and I/O patterns so subsets stay correct and large datasets remain workable on typical solo-builder machines.
- AnnData construction and subsetting code using sparse, categorical, and backed patterns
- Refactored pipelines with documented view/copy safety
- Memory-conscious read paths for large on-disk matrices
Recommended Skills
Journey fit
AnnData workflows happen while you implement data-heavy scientific code, not during distribution or ops dashboards. Memory layout, sparse matrices, and backed H5AD reads are backend data-layer concerns for Python analysis stacks.
How it compares
Library-focused procedural guidance for AnnData—not a generic pandas cheat sheet or an MCP data connector.
Common Questions / FAQ
Who is anndata for?
Solo and indie developers shipping Python scientific or genomics pipelines who want coding agents to respect AnnData memory, sparse, and backed-file conventions.
When should I use anndata?
During Build backend work when implementing QC, filtering, feature matrices, or H5AD read/write paths—especially before scaling to datasets larger than available RAM.
Is anndata safe to install?
Review the Security Audits panel on this Prism page and the upstream skill repo before enabling it in agents with filesystem or shell access to sensitive datasets.
SKILL.md
READMESKILL.md - Anndata
# Best Practices Guidelines for efficient and effective use of AnnData. ## Memory Management ### Use sparse matrices for sparse data ```python import numpy as np from scipy.sparse import csr_matrix, csc_matrix import anndata as ad # Check data sparsity data = np.random.rand(1000, 2000) sparsity = 1 - np.count_nonzero(data) / data.size print(f"Sparsity: {sparsity:.2%}") # Convert to sparse if >50% zeros (anndata 0.12+ requires csr or csc) if sparsity > 0.5: adata = ad.AnnData(X=csr_matrix(data)) else: adata = ad.AnnData(X=data) # Benefits: 10-100x memory reduction for sparse genomics data ``` ### Convert strings to categoricals ```python # Inefficient: string columns use lots of memory adata.obs['cell_type'] = ['Type_A', 'Type_B', 'Type_C'] * 333 + ['Type_A'] # Efficient: convert to categorical adata.obs['cell_type'] = adata.obs['cell_type'].astype('category') # Convert all string columns adata.strings_to_categoricals() # Benefits: 10-50x memory reduction for repeated strings ``` ### Use backed mode for large datasets ```python # Don't load entire dataset into memory adata = ad.read_h5ad('large_dataset.h5ad', backed='r') # Work with metadata filtered = adata[adata.obs['quality'] > 0.8] # Load only filtered subset adata_subset = filtered.to_memory() # Benefits: Work with datasets larger than RAM ``` ## Views vs Copies ### Understanding views ```python # Subsetting creates a view by default subset = adata[0:100, :] print(subset.is_view) # True # Views don't copy data (memory efficient) # But modifications can affect original # Check if object is a view if adata.is_view: adata = adata.copy() # Make independent ``` ### When to use views ```python # Good: Read-only operations on subsets mean_expr = adata[adata.obs['cell_type'] == 'T cell'].X.mean() # Good: Temporary analysis temp_subset = adata[:100, :] result = analyze(temp_subset.X) ``` ### When to use copies ```python # Create independent copy for modifications adata_filtered = adata[keep_cells, :].copy() # Safe to modify without affecting original adata_filtered.obs['new_column'] = values # Always copy when: # - Storing subset for later use # - Modifying subset data # - Passing to function that modifies data ``` ## Data Storage Best Practices ### Choose the right format **H5AD (HDF5) - Default choice** ```python adata.write_h5ad('data.h5ad', compression='gzip') ``` - Fast random access - Supports backed mode - Good compression - Best for: Most use cases **Zarr - Cloud and parallel access** ```python import anndata # Default is Zarr v2; opt into v3 for cloud workflows (anndata 0.12+) anndata.settings.zarr_write_format = 3 anndata.settings.auto_shard_zarr_v3 = True adata.write_zarr('data.zarr', chunks=(100, 100)) ``` - Excellent for cloud storage (S3, GCS) - Supports parallel I/O and Zarr v3 sharding (0.12+) - Good compression - Best for: Large datasets, cloud workflows, parallel processing **CSV - Interoperability** ```python adata.write_csvs('output_dir/') ``` - Human readable - Compatible with all tools - Large file sizes, slow - Best for: Sharing with non-Python tools, small datasets ### Optimize file size ```python # Before saving, optimize: # 1. Convert to sparse if appropriate from scipy.sparse import csr_matrix, issparse if not issparse(adata.X): density = np.count_nonzero(adata.X) / adata.X.size if density < 0.5: adata.X = csr_matrix(adata.X) # 2. Convert strings to categoricals adata.strings_to_categoricals() # 3. Use compression adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9) # Typical results: 5-20x file size reduction ``` ## Backed Mode Strategies ### Read-only analysis ```python # Open in read-only backed mode adata = ad.read_h5ad('data.h5ad', backed='r') # Perform filtering without loading data high_quality = adata[adata.obs['quality_score'] > 0.8] # Load only filtered data adata_filtered = high_quality.to_memory() ``` ### Read-write modifications ```python # Open in read-writ