
Protein Sequence Similarity Search
Run a quick protein homologue search via ColabFold MMseqs2 and export a sorted Markdown table of hits for literature or pipeline follow-up.
Overview
Protein Sequence Similarity Search is an agent skill for the Idea phase that runs ColabFold MMseqs2 homologue search and saves a Markdown E-value–sorted hit table.
Install
npx skills add https://github.com/google-deepmind/science-skills --skill protein-sequence-similarity-searchWhat is this skill?
- Submits sequences to the ColabFold MMseqs2 API and polls until results are ready (15-minute timeout)
- Downloads MSA archive, parses A3M headers, and sorts homologues by E-value
- Writes a Markdown table to a required --output path
- Caps alignment reporting at 300 hits (MAX_ALIGNMENT_HITS)
- Python ≥3.10 script with scienceskillscommon HTTP client utilities
- Reports up to 300 alignment hits (MAX_ALIGNMENT_HITS)
- 15-minute polling timeout (POLLING_TIMEOUT)
Adoption & trust: 548 installs on skills.sh; 1.7k GitHub stars; 2/3 security scanners passed (skills.sh audits).
What problem does it solve?
You have a raw protein sequence but no fast, repeatable way to pull an MSA-based homologue list with scores into your repo.
Who is it for?
Bioinformatics or ML builders prototyping on protein targets who want API-driven MMseqs2 search with file-based output.
Skip if: Non-protein workflows, offline-only environments, or teams that need interactive BLAST UI exploration without scripting.
When should I use this skill?
You need a protein homologue / MSA hit list from ColabFold MMseqs2 written to a Markdown output file.
What do I get? / Deliverables
You get a Markdown file listing up to 300 alignment hits with bit score, identity, and E-value for downstream structural or evolutionary analysis.
- Markdown table of homologues sorted by E-value
- Downloaded MSA archive from ColabFold
Recommended Skills
Journey fit
Homologue discovery is an early research step before designing assays, models, or bioinformatics pipelines. Sequence similarity search fits the research subphase where you gather external evidence and related sequences before committing to an approach.
How it compares
Use for ColabFold MMseqs2 API automation instead of hand-running web Colab notebooks for every sequence.
Common Questions / FAQ
Who is protein-sequence-similarity-search for?
Researchers, indie bioinformatics builders, and agent users in computational biology who need scripted protein homologue tables from ColabFold.
When should I use protein-sequence-similarity-search?
During Idea research when validating a target protein, comparing orthologs, or seeding structural prediction inputs before you scope a build or prototype.
Is protein-sequence-similarity-search safe to install?
Review the Security Audits panel on this Prism page before running; the skill calls external ColabFold APIs and writes local output files.
SKILL.md
READMESKILL.md - Protein Sequence Similarity Search
# Copyright 2026 Google LLC # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Quick protein homologue search via ColabFold MMseqs2 API. Submits a protein sequence to the ColabFold MMseqs2 server, downloads the resulting MSA archive, parses the A3M alignment headers, and writes a Markdown-formatted table of sequence homologues sorted by E-value to an output file specified via the required --output flag. """ # /// script # requires-python = ">=3.10" # dependencies = [ # "scienceskillscommon", # ] # [tool.uv.sources] # scienceskillscommon = { path = "../../scienceskillscommon" } # /// import argparse import json import os import shutil import sys import tarfile import tempfile import time import urllib.error import urllib.parse from science_skills.scienceskillscommon import http_client MAX_ALIGNMENT_HITS = 300 POLLING_TIMEOUT = 15 * 60 # 15 minutes. COLABFOLD_HOST = "https://api.colabfold.com" FASTA_COLUMNS = [ "target", "bit_score", "identity", "e_value", "q_start", "q_end", "q_len", "t_start", "t_end", "t_len", ] CLIENT = http_client.HttpClient(COLABFOLD_HOST, qps=2) def read_sequence(query_input, tty=None): """Read sequence from a file path or raw string.""" if os.path.isfile(query_input): print(f"[*] Reading sequence from file: {query_input}") if tty: print(f"[*] Reading sequence from file: {query_input}", file=tty) with open(query_input, "r") as f: sequence = f.read().strip() if sequence.startswith(">"): print("[*] Sequence is in FASTA format") if tty: print("[*] Sequence is in FASTA format", file=tty) sequence = "".join(sequence.split("\n")[1:]) # Remove FASTA header else: print("[*] Sequence is in raw format") # No further processing needed if tty: print("[*] Sequence is in raw format", file=tty) else: print("[*] Using raw sequence string provided via command line.") sequence = query_input.strip() if not sequence: print("[!] Error: Empty sequence provided.") if tty: print("[!] Error: Empty sequence provided.", file=tty) sys.exit(1) return sequence def parse_a3m(file_path, q_len): """Parse ColabFold-annotated A3M headers into hit dictionaries.""" homologues = [] with open(file_path, "r") as f: for line in f: if not line.startswith(">"): continue parts = line.strip().split() # Skip query header (no stat columns) or malformed lines if len(parts) < 10: continue try: hit = dict(zip(FASTA_COLUMNS, parts, strict=True)) for col in ["q_start", "q_end", "q_len", "t_start", "t_end", "t_len"]: hit[col] = int(hit[col]) for col in ["bit_score", "identity", "e_value"]: hit[col] = float(hit[col]) # Query Coverage if q_len > 0 and hit["q_end"] > hit["q_start"]: aligned_residues = hit["q_end"] - hit["q_start"] + 1 cov = min((aligned_residues / q_len) * 100, 100.0) else: cov = 0.0 # Alignment length (target span) if hit["t_end"] > hit["t_start"]: aln_len = hit["t_end"] - hit["t_start"] + 1 else: aln_len = 0 hit |= { "target_id": hit["target"][1:], # Strip leading '>' "q_cov": cov, "aln_len": aln_len, } homologues.append(hit) except (ValueError, IndexError): print( f"[!] Warning: S