
Scikit Bio
Gives your coding agent accurate scikit-bio patterns for sequences, alignments, trees, diversity, and file I/O when you build bioinformatics or microbiome tooling.
Overview
scikit-bio is an agent skill for the Build phase that supplies API reference patterns for the scikit-bio Python library (sequences, alignments, phylogeny, diversity, and I/O).
Install
npx skills add https://github.com/k-dense-ai/scientific-agent-skills --skill scikit-bioWhat is this skill?
- Covers DNA, RNA, and Protein classes with reverse complement, transcription, and genetic-code translation
- Documents alignment methods, phylogenetic trees, diversity metrics, ordination, and statistical tests
- Includes distance matrices, file I/O patterns, and a dedicated troubleshooting section
- Shows regex motif search, k-mer frequencies, and sequence metadata handling
- Structured as a nine-topic API reference table of contents for agent lookup
- 9-topic API reference table of contents
Adoption & trust: 541 installs on skills.sh; 27.6k GitHub stars; 3/3 security scanners passed (skills.sh audits).
What problem does it solve?
You are implementing bioinformatics features and need reliable scikit-bio class usage, operations, and file formats without guessing the API.
Who is it for?
Indie builders or researchers coding agents that generate Python bioinformatics pipelines, notebooks, or CLI tools on top of scikit-bio.
Skip if: Non-scientific web or mobile apps with no sequence or omics data, or teams that only need high-level biology explanations without library calls.
When should I use this skill?
Implementing or debugging Python code that uses scikit-bio sequences, alignments, trees, diversity, or file I/O.
What do I get? / Deliverables
Your agent produces runnable skbio code with correct sequence types, analyses, and I/O aligned to the reference sections.
- Correct skbio API usage in Python modules or notebooks
- Sequence and analysis code aligned to reference patterns
Recommended Skills
Journey fit
Bioinformatics code lands in the build phase when you wire Python libraries into pipelines, notebooks, or backend analysis services. Integrations is the canonical shelf for library API reference skills that connect agent output to domain packages like skbio.
How it compares
Use as a domain API skill package, not a generic Python tutorial or an MCP server.
Common Questions / FAQ
Who is scikit-bio for?
Solo builders and small teams whose agents write Python for sequences, alignments, trees, or diversity metrics using scikit-bio.
When should I use scikit-bio?
During Build when integrating omics analysis into backends or CLIs, and when debugging agent-generated skbio scripts against the documented API sections.
Is scikit-bio safe to install?
Review the Security Audits panel on this Prism page and inspect the skill bundle in your repo before granting filesystem or network access to your agent.
SKILL.md
READMESKILL.md - Scikit Bio
# scikit-bio API Reference This document provides detailed API information, advanced examples, and troubleshooting guidance for working with scikit-bio. ## Table of Contents 1. [Sequence Classes](#sequence-classes) 2. [Alignment Methods](#alignment-methods) 3. [Phylogenetic Trees](#phylogenetic-trees) 4. [Diversity Metrics](#diversity-metrics) 5. [Ordination](#ordination) 6. [Statistical Tests](#statistical-tests) 7. [Distance Matrices](#distance-matrices) 8. [File I/O](#file-io) 9. [Troubleshooting](#troubleshooting) ## Sequence Classes ### DNA, RNA, and Protein Classes ```python from skbio import DNA, RNA, Protein, Sequence # Creating sequences dna = DNA('ATCGATCG', metadata={'id': 'seq1', 'description': 'Example'}) rna = RNA('AUCGAUCG') protein = Protein('ACDEFGHIKLMNPQRSTVWY') # Sequence operations dna_rc = dna.reverse_complement() # Reverse complement rna = dna.transcribe() # DNA -> RNA protein = rna.translate() # RNA -> Protein # Using genetic code tables protein = rna.translate(genetic_code=11) # Bacterial code ``` ### Sequence Searching and Pattern Matching ```python # Find motifs using regex dna = DNA('ATGCGATCGATGCATCG') motif_locs = dna.find_with_regex('ATG.{3}') # Start codons # Find all positions import re for match in re.finditer('ATG', str(dna)): print(f"ATG found at position {match.start()}") # k-mer counting from skbio.sequence import _motifs kmers = dna.kmer_frequencies(k=3) ``` ### Handling Sequence Metadata ```python # Sequence-level metadata dna = DNA('ATCG', metadata={'id': 'seq1', 'source': 'E. coli'}) print(dna.metadata['id']) # Positional metadata (per-base quality scores from FASTQ) from skbio import DNA seqs = DNA.read('reads.fastq', format='fastq', phred_offset=33) quality_scores = seqs.positional_metadata['quality'] # Interval metadata (features/annotations) dna.interval_metadata.add([(5, 15)], metadata={'type': 'gene', 'name': 'geneA'}) ``` ### Distance Calculations ```python from skbio import DNA seq1 = DNA('ATCGATCG') seq2 = DNA('ATCG--CG') # Hamming distance (default) dist = seq1.distance(seq2) # Custom distance function from skbio.sequence.distance import kmer_distance dist = seq1.distance(seq2, metric=kmer_distance) ``` ## Alignment Methods ### Pairwise Alignment scikit-bio 0.7.0 introduced `pair_align`, a single fast engine for global, local, and semi-global alignment. The convenience wrappers `pair_align_nucl` and `pair_align_prot` ship with BLASTN/BLASTP-like scoring. The old SSW wrapper (`local_pairwise_align_ssw`, `StripedSmithWaterman`) was removed, and the pure-Python `global_pairwise_align`/`local_pairwise_align_*` functions are deprecated. ```python from skbio import DNA, Protein from skbio.alignment import pair_align_nucl, pair_align_prot, pair_align, align_score # Nucleotide alignment with BLASTN-like defaults seq1 = DNA('ATCGATCGATCG') seq2 = DNA('ATCGGGGATCG') aln = pair_align_nucl(seq1, seq2) # Inspect the result (PairAlignResult: score + paths [+ matrices]) print(f"Score: {aln.score}") path = aln.paths[0] # PairAlignPath; repr shows the CIGAR aligned_seqs = path.to_aligned((seq1, seq2)) # list of gapped strings # Global alignment with custom affine scoring via pair_align aln = pair_align( seq1, seq2, mode='global', # 'global' (default), 'local', or semi-global via free_ends sub_score=(2, -3), # (match, mismatch) gap_cost=(5, 2), # (open, extend) -> affine; a single number -> linear ) # Use a named substitution matrix instead of match/mismatch scores aln = pair_align(seq1, seq2, mode='global', sub_score='NUC.4.4', gap_cost=3) # Protein alignment with BLASTP-like defaults (BLOSUM62) protein_query = Protein('ACDEFGHIKLMNPQRSTVWY') protein_target = Protein('ACDEFMNPQRSTVWY') aln = pair_align_prot(protein_query, protein_target) # Re-score an existing alignment with the same parameters score = align_score((aln.paths[0], (protein_query, protein_target)), sub_score='