
Pydeseq2
Run a DESeq2-style RNA-seq differential expression workflow in Python with correct DeseqDataSet APIs, design formulas, and pipeline steps.
Overview
pydeseq2 is an agent skill for the Build phase that documents PyDESeq2 DeseqDataSet APIs and the full deseq2() normalization and dispersion pipeline for RNA-seq counts.
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill pydeseq2What is this skill?
- DeseqDataSet initialization with counts matrix, sample metadata, and Wilkinson design strings
- deseq2() documents an 8-step in-place pipeline from size factors through Cook's distance and optional refit
- Parameters for refit_cooks, n_cpus parallelism, and quiet logging during long runs
- to_picklable_anndata() conversion path for interoperable single-cell style workflows
- API-oriented reference for genewise dispersions, MAP dispersions, and log fold-change fitting
- deseq2() documents 8 analysis steps from normalization through optional Cook's refit
Adoption & trust: 519 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have RNA-seq counts and sample metadata but need correct PyDESeq2 class usage, design formulas, and pipeline step order without silent statistical mistakes.
Who is it for?
Solo builders shipping bioinformatics notebooks, internal analysis tools, or reproducible DE pipelines who want agent code to match PyDESeq2’s documented flow.
Skip if: Builders without count data or experimental metadata, or anyone seeking a high-level biology tutorial instead of API-level DESeq2 implementation.
When should I use this skill?
Implementing or debugging PyDESeq2 differential expression analysis with DeseqDataSet counts, metadata, and statistical design formulas.
What do I get? / Deliverables
After using the skill, your agent produces DeseqDataSet setup and deseq2() calls with documented parameters and optional AnnData export for downstream visualization or reporting.
- DeseqDataSet construction code with design and compute flags
- Executed deseq2() workflow notes and optional AnnData export snippets
Recommended Skills
Journey fit
RNA-seq analysis code belongs in the build phase when you are implementing computational pipelines, not when you are only validating a market idea. Backend fits because the skill centers on batch statistical computation, normalization, and in-place fitting—not UI, docs site, or agent tooling shells.
How it compares
Skill reference for the PyDESeq2 library—not a substitute for wet-lab QC, raw aligner choice, or GUI-centric stats packages.
Common Questions / FAQ
Who is pydeseq2 for?
It is for developers and scientist-builders implementing RNA-seq differential expression in Python who want agents to use DeseqDataSet and deseq2() correctly.
When should I use pydeseq2?
Use it during Build when wiring backend analysis scripts, batch jobs, or reproducible pipelines after counts and sample annotations exist; not for early market validation without data.
Is pydeseq2 safe to install?
Check the Security Audits panel on this Prism page before installing; the skill describes scientific computing code that may run local Python with filesystem access to count matrices.
SKILL.md
READMESKILL.md - Pydeseq2
# PyDESeq2 API Reference This document provides comprehensive API reference for PyDESeq2 classes, methods, and utilities. ## Core Classes ### DeseqDataSet The main class for differential expression analysis that handles data processing from normalization through log-fold change fitting. **Purpose:** Implements dispersion and log fold-change (LFC) estimation for RNA-seq count data. **Initialization Parameters:** - `counts`: pandas DataFrame of shape (samples × genes) containing non-negative integer read counts - `metadata`: pandas DataFrame of shape (samples × variables) with sample annotations - `design`: str, Wilkinson formula specifying the statistical model (e.g., "~condition", "~group + condition") - `refit_cooks`: bool, whether to refit parameters after removing Cook's distance outliers (default: True) - `n_cpus`: int, number of CPUs to use for parallel processing (optional) - `quiet`: bool, suppress progress messages (default: False) **Key Methods:** #### `deseq2()` Run the complete DESeq2 pipeline for normalization and dispersion/LFC fitting. **Steps performed:** 1. Compute normalization factors (size factors) 2. Fit genewise dispersions 3. Fit dispersion trend curve 4. Calculate dispersion priors 5. Fit MAP (maximum a posteriori) dispersions 6. Fit log fold changes 7. Calculate Cook's distances for outlier detection 8. Optionally refit if `refit_cooks=True` **Returns:** None (modifies object in-place) #### `to_picklable_anndata()` Convert the DeseqDataSet to an AnnData object that can be saved with pickle. **Returns:** AnnData object with: - `X`: count data matrix - `obs`: sample-level metadata (1D) - `var`: gene-level metadata (1D) - `varm`: gene-level multi-dimensional data (e.g., LFC estimates) **Usage:** ```python import pickle with open("result_adata.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f) ``` **Attributes (after running deseq2()):** - `layers`: dict containing various matrices (normalized counts, etc.) - `varm`: dict containing gene-level results (log fold changes, dispersions, etc.) - `obsm`: dict containing sample-level information - `uns`: dict containing global parameters --- ### DeseqStats Class for performing statistical tests and computing p-values for differential expression. **Purpose:** Facilitates PyDESeq2 statistical tests using Wald tests and optional LFC shrinkage. **Initialization Parameters:** - `dds`: DeseqDataSet object that has been processed with `deseq2()` - `contrast`: list or None, specifies the contrast for testing - Format: `[variable, test_level, reference_level]` - Example: `["condition", "treated", "control"]` tests treated vs control - If None, uses the last coefficient in the design formula - `alpha`: float, significance threshold for independent filtering (default: 0.05) - `cooks_filter`: bool, whether to filter outliers based on Cook's distance (default: True) - `independent_filter`: bool, whether to perform independent filtering (default: True) - `n_cpus`: int, number of CPUs for parallel processing (optional) - `quiet`: bool, suppress progress messages (default: False) **Key Methods:** #### `summary()` Run Wald tests and compute p-values and adjusted p-values. **Steps performed:** 1. Run Wald statistical tests for specified contrast 2. Optional Cook's distance filtering 3. Optional independent filtering to remove low-power tests 4. Multiple testing correction (Benjamini-Hochberg procedure) **Returns:** None (results stored in `results_df` attribute) **Result DataFrame columns:** - `baseMean`: mean normalized count across all samples - `log2FoldChange`: log2 fold change between conditions - `lfcSE`: standard error of the log2 fold change - `stat`: Wald test statistic - `pvalue`: raw p-value - `padj`: adjusted p-value (FDR-corrected) #### `lfc_shrink(coeff=None)` Apply shrinkage to log fold changes using the apeGLM method. **Purpose:** Reduces noise in LFC estimates for better visualization and ranking, especially for genes w