
Scanpy
Execute an end-to-end single-cell RNA-seq workflow in Python with scanpy from QC through clustering.
Overview
Scanpy is an agent skill for the Build phase that runs a configurable single-cell RNA-seq analysis workflow using scanpy from data loading through clustering.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill scanpyWhat is this skill?
- End-to-end template: load h5ad, QC filters, HVG selection, PCA, neighbors, Leiden clustering.
- Configurable QC gates (min genes/cells, mitochondrial threshold) and analysis knobs (PCs, resolution).
- Sets scanpy verbosity, figure DPI, autosave, and dedicated results/figures directories.
- Uses standard stack: scanpy, pandas, numpy, matplotlib.
- Structured sections from configuration through load-data for copy-paste customization.
- Default N_TOP_GENES = 2000, N_PCS = 40, N_NEIGHBORS = 10, LEIDEN_RESOLUTION = 0.5 in the template configuration block.
Adoption & trust: 521 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have single-cell count data but lack a standardized, parameter-driven scanpy pipeline you can rerun and hand to an agent to extend.
Who is it for?
Indie scientists and ML builders prototyping scRNA-seq pipelines in Python with h5ad inputs.
Skip if: Teams needing only bulk RNA-seq, production orchestration (Nextflow/Snakemake) without editing this template, or GUI-only exploration.
When should I use this skill?
You need to load single-cell data and run QC through clustering with scanpy using explicit, editable parameters.
What do I get? / Deliverables
You get a executed workflow with QC-filtered cells, HVG-based embedding, neighbor graph, and Leiden clusters saved under your results and figures paths.
- Processed AnnData checkpoint through clustering steps
- Saved figures under configured FIGURES_DIR
Recommended Skills
Journey fit
How it compares
A procedural analysis template skill—not a hosted notebook service or auto-annotation MCP server.
Common Questions / FAQ
Who is scanpy for?
Solo builders and small labs doing single-cell genomics in Python who want an agent-guided scanpy workflow template.
When should I use scanpy?
During Build when implementing or rerunning a scRNA-seq pipeline on h5ad data, tuning QC and clustering parameters before downstream cell-type annotation.
Is scanpy safe to install?
It drives local scientific Python execution on your files—check the Security Audits panel on this page and run in an isolated env with only data you are allowed to process.
SKILL.md
READMESKILL.md - Scanpy
#!/usr/bin/env python3 """ Complete Single-Cell Analysis Template This template provides a complete workflow for single-cell RNA-seq analysis using scanpy, from data loading through clustering and cell type annotation. Customize the parameters and sections as needed for your specific dataset. """ import scanpy as sc import pandas as pd import numpy as np import matplotlib.pyplot as plt # ============================================================================ # CONFIGURATION # ============================================================================ # File paths INPUT_FILE = 'data/raw_counts.h5ad' # Change to your input file OUTPUT_DIR = 'results/' FIGURES_DIR = 'figures/' # QC parameters MIN_GENES = 200 # Minimum genes per cell MIN_CELLS = 3 # Minimum cells per gene MT_THRESHOLD = 5 # Maximum mitochondrial percentage # Analysis parameters N_TOP_GENES = 2000 # Number of highly variable genes N_PCS = 40 # Number of principal components N_NEIGHBORS = 10 # Number of neighbors for graph LEIDEN_RESOLUTION = 0.5 # Clustering resolution # Scanpy settings sc.settings.verbosity = 3 sc.settings.set_figure_params(dpi=80, facecolor='white') sc.settings.figdir = FIGURES_DIR sc.settings.autosave = True # ============================================================================ # 1. LOAD DATA # ============================================================================ print("=" * 80) print("LOADING DATA") print("=" * 80) # Load data (adjust based on your file format) adata = sc.read_h5ad(INPUT_FILE) # adata = sc.read_10x_mtx('data/filtered_gene_bc_matrices/') # For 10X data # adata = sc.read_csv('data/counts.csv') # For CSV data print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes") # ============================================================================ # 2. QUALITY CONTROL # ============================================================================ print("\n" + "=" * 80) print("QUALITY CONTROL") print("=" * 80) # Identify mitochondrial genes adata.var['mt'] = adata.var_names.str.startswith('MT-') # Calculate QC metrics sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) # Visualize QC metrics before filtering sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'], jitter=0.4, multi_panel=True, save='_qc_before_filtering') sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', save='_qc_mt') sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', save='_qc_genes') # Filter cells and genes print(f"\nBefore filtering: {adata.n_obs} cells, {adata.n_vars} genes") sc.pp.filter_cells(adata, min_genes=MIN_GENES) sc.pp.filter_genes(adata, min_cells=MIN_CELLS) adata = adata[adata.obs.pct_counts_mt < MT_THRESHOLD, :] print(f"After filtering: {adata.n_obs} cells, {adata.n_vars} genes") # Optional: doublet detection (uncomment; run before normalization) # sc.pp.scrublet(adata) # adata = adata[~adata.obs['predicted_doublet'], :].copy() # print(f"After doublet removal: {adata.n_obs} cells") # ============================================================================ # 3. NORMALIZATION # ============================================================================ print("\n" + "=" * 80) print("NORMALIZATION") print("=" * 80) # Normalize to 10,000 counts per cell sc.pp.normalize_total(adata, target_sum=1e4) # Log-transform sc.pp.log1p(adata) # Store normalized data adata.raw = adata # ============================================================================ # 4. FEATURE SELECTION # ============================================================================ print("\n" + "=" * 80) print("FEATURE SELECTION") print("=" * 80) # Identify highly variable genes sc.pp.highly_variable_genes(adata, n_top_genes=N_TOP_GENES) # Visualize sc.pl.highly_variable_genes(adata, save='_hvg') print(f"Selected {sum(adata.var.highly_variable)} high