Protein Sequence Similarity Search

Name: Protein Sequence Similarity Search
Author: google-deepmind

google-deepmind/science-skills

Run a quick protein homologue search via ColabFold MMseqs2 and export a sorted Markdown table of hits for literature or pipeline follow-up.

Overview

Protein Sequence Similarity Search is an agent skill for the Idea phase that runs ColabFold MMseqs2 homologue search and saves a Markdown E-value–sorted hit table.

Install

npx skills add https://github.com/google-deepmind/science-skills --skill protein-sequence-similarity-search

What is this skill?

Submits sequences to the ColabFold MMseqs2 API and polls until results are ready (15-minute timeout)
Downloads MSA archive, parses A3M headers, and sorts homologues by E-value
Writes a Markdown table to a required --output path
Caps alignment reporting at 300 hits (MAX_ALIGNMENT_HITS)
Python ≥3.10 script with scienceskillscommon HTTP client utilities
Reports up to 300 alignment hits (MAX_ALIGNMENT_HITS)
15-minute polling timeout (POLLING_TIMEOUT)

Compatible agents: Claude Code, Codex, Cursor, any compatible agent

Adoption & trust: 548 installs on skills.sh; 1.7k GitHub stars; 2/3 security scanners passed (skills.sh audits).

What problem does it solve?

You have a raw protein sequence but no fast, repeatable way to pull an MSA-based homologue list with scores into your repo.

Who is it for?

Bioinformatics or ML builders prototyping on protein targets who want API-driven MMseqs2 search with file-based output.

Skip if: Non-protein workflows, offline-only environments, or teams that need interactive BLAST UI exploration without scripting.

When should I use this skill?

You need a protein homologue / MSA hit list from ColabFold MMseqs2 written to a Markdown output file.

What do I get? / Deliverables

You get a Markdown file listing up to 300 alignment hits with bit score, identity, and E-value for downstream structural or evolutionary analysis.

Markdown table of homologues sorted by E-value
Downloaded MSA archive from ColabFold

Recommended Skills

Paper Context Resolverlllllllama/ai-paper-reproduction-skill

Optional helper-tier skill that supplements README-guided deep learning reproduction by resolving specific paper details…140k installs·412 stars

Repo Intake And Planlllllllama/ai-paper-reproduction-skill

Rigor Intake scans repository docs and layout to classify documented commands and propose a minimal reproduction plan fo…140k installs·412 stars

Env And Assets Bootstraplllllllama/ai-paper-reproduction-skill

Rigor Setup establishes conservative environment and asset assumptions aligned with README and config evidence before ex…140k installs·412 stars

Minimal Run And Auditlllllllama/ai-paper-reproduction-skill

RigorPilot executes the selected minimal reproduction command and produces normalized, auditable run evidence for paper …140k installs·412 stars

Analyze Projectlllllllama/rigorpilot-skills

analyze-project is a read-only agent skill from the RigorPilot family aimed at solo builders and small teams inheriting …32.3k installs·412 stars

Ai Research Reproductionlllllllama/rigorpilot-skills

ai-research-reproduction is the RigorPilot Reproduce orchestrator for solo builders and small teams who need to rerun a …32.3k installs·412 stars

Journey fit

Primary fit

IdeaOpportunity & market research

Homologue discovery is an early research step before designing assays, models, or bioinformatics pipelines. Sequence similarity search fits the research subphase where you gather external evidence and related sequences before committing to an approach.

Also useful

ValidatePrototype & spike

How it compares

Use for ColabFold MMseqs2 API automation instead of hand-running web Colab notebooks for every sequence.

Common Questions / FAQ

Who is protein-sequence-similarity-search for?

Researchers, indie bioinformatics builders, and agent users in computational biology who need scripted protein homologue tables from ColabFold.

When should I use protein-sequence-similarity-search?

During Idea research when validating a target protein, comparing orthologs, or seeding structural prediction inputs before you scope a build or prototype.

Is protein-sequence-similarity-search safe to install?

Review the Security Audits panel on this Prism page before running; the skill calls external ColabFold APIs and writes local output files.

SKILL.md

READMESKILL.md - Protein Sequence Similarity Search

# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Quick protein homologue search via ColabFold MMseqs2 API.

Submits a protein sequence to the ColabFold MMseqs2 server, downloads the
resulting MSA archive, parses the A3M alignment headers, and writes a
Markdown-formatted table of sequence homologues sorted by E-value to an
output file specified via the required --output flag.
"""

# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "scienceskillscommon",
# ]
# [tool.uv.sources]
# scienceskillscommon = { path = "../../scienceskillscommon" }
# ///

import argparse
import json
import os
import shutil
import sys
import tarfile
import tempfile
import time
import urllib.error
import urllib.parse

from science_skills.scienceskillscommon import http_client

MAX_ALIGNMENT_HITS = 300
POLLING_TIMEOUT = 15 * 60  # 15 minutes.
COLABFOLD_HOST = "https://api.colabfold.com"
FASTA_COLUMNS = [
    "target",
    "bit_score",
    "identity",
    "e_value",
    "q_start",
    "q_end",
    "q_len",
    "t_start",
    "t_end",
    "t_len",
]
CLIENT = http_client.HttpClient(COLABFOLD_HOST, qps=2)


def read_sequence(query_input, tty=None):
  """Read sequence from a file path or raw string."""
  if os.path.isfile(query_input):
    print(f"[*] Reading sequence from file: {query_input}")
    if tty:
      print(f"[*] Reading sequence from file: {query_input}", file=tty)
    with open(query_input, "r") as f:
      sequence = f.read().strip()
      if sequence.startswith(">"):
        print("[*] Sequence is in FASTA format")
        if tty:
          print("[*] Sequence is in FASTA format", file=tty)
        sequence = "".join(sequence.split("\n")[1:])  # Remove FASTA header
      else:
        print("[*] Sequence is in raw format")  # No further processing needed
        if tty:
          print("[*] Sequence is in raw format", file=tty)
  else:
    print("[*] Using raw sequence string provided via command line.")
    sequence = query_input.strip()

  if not sequence:
    print("[!] Error: Empty sequence provided.")
    if tty:
      print("[!] Error: Empty sequence provided.", file=tty)
    sys.exit(1)

  return sequence


def parse_a3m(file_path, q_len):
  """Parse ColabFold-annotated A3M headers into hit dictionaries."""
  homologues = []

  with open(file_path, "r") as f:
    for line in f:
      if not line.startswith(">"):
        continue

      parts = line.strip().split()

      # Skip query header (no stat columns) or malformed lines
      if len(parts) < 10:
        continue

      try:
        hit = dict(zip(FASTA_COLUMNS, parts, strict=True))
        for col in ["q_start", "q_end", "q_len", "t_start", "t_end", "t_len"]:
          hit[col] = int(hit[col])
        for col in ["bit_score", "identity", "e_value"]:
          hit[col] = float(hit[col])

        # Query Coverage
        if q_len > 0 and hit["q_end"] > hit["q_start"]:
          aligned_residues = hit["q_end"] - hit["q_start"] + 1
          cov = min((aligned_residues / q_len) * 100, 100.0)
        else:
          cov = 0.0

        # Alignment length (target span)
        if hit["t_end"] > hit["t_start"]:
          aln_len = hit["t_end"] - hit["t_start"] + 1
        else:
          aln_len = 0

        hit |= {
            "target_id": hit["target"][1:],  # Strip leading '>'
            "q_cov": cov,
            "aln_len": aln_len,
        }
        homologues.append(hit)
      except (ValueError, IndexError):
        print(
            f"[!] Warning: S

What is this skill?

Submits sequences to the ColabFold MMseqs2 API and polls until results are ready (15-minute timeout)

Downloads MSA archive, parses A3M headers, and sorts homologues by E-value

Writes a Markdown table to a required --output path

Caps alignment reporting at 300 hits (MAX_ALIGNMENT_HITS)

Python ≥3.10 script with scienceskillscommon HTTP client utilities

Reports up to 300 alignment hits (MAX_ALIGNMENT_HITS)

15-minute polling timeout (POLLING_TIMEOUT)

Compatible agents: Claude Code, Codex, Cursor, any compatible agent

Adoption & trust: 548 installs on skills.sh; 1.7k GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Primary fit

IdeaOpportunity & market research

Also useful

ValidatePrototype & spike

SKILL.md

READMESKILL.md - Protein Sequence Similarity Search

# Copyright 2026 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Quick protein homologue search via ColabFold MMseqs2 API.

Submits a protein sequence to the ColabFold MMseqs2 server, downloads the
resulting MSA archive, parses the A3M alignment headers, and writes a
Markdown-formatted table of sequence homologues sorted by E-value to an
output file specified via the required --output flag.
"""

# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "scienceskillscommon",
# ]
# [tool.uv.sources]
# scienceskillscommon = { path = "../../scienceskillscommon" }
# ///

import argparse
import json
import os
import shutil
import sys
import tarfile
import tempfile
import time
import urllib.error
import urllib.parse

from science_skills.scienceskillscommon import http_client

MAX_ALIGNMENT_HITS = 300
POLLING_TIMEOUT = 15 * 60  # 15 minutes.
COLABFOLD_HOST = "https://api.colabfold.com"
FASTA_COLUMNS = [
    "target",
    "bit_score",
    "identity",
    "e_value",
    "q_start",
    "q_end",
    "q_len",
    "t_start",
    "t_end",
    "t_len",
]
CLIENT = http_client.HttpClient(COLABFOLD_HOST, qps=2)


def read_sequence(query_input, tty=None):
  """Read sequence from a file path or raw string."""
  if os.path.isfile(query_input):
    print(f"[*] Reading sequence from file: {query_input}")
    if tty:
      print(f"[*] Reading sequence from file: {query_input}", file=tty)
    with open(query_input, "r") as f:
      sequence = f.read().strip()
      if sequence.startswith(">"):
        print("[*] Sequence is in FASTA format")
        if tty:
          print("[*] Sequence is in FASTA format", file=tty)
        sequence = "".join(sequence.split("\n")[1:])  # Remove FASTA header
      else:
        print("[*] Sequence is in raw format")  # No further processing needed
        if tty:
          print("[*] Sequence is in raw format", file=tty)
  else:
    print("[*] Using raw sequence string provided via command line.")
    sequence = query_input.strip()

  if not sequence:
    print("[!] Error: Empty sequence provided.")
    if tty:
      print("[!] Error: Empty sequence provided.", file=tty)
    sys.exit(1)

  return sequence


def parse_a3m(file_path, q_len):
  """Parse ColabFold-annotated A3M headers into hit dictionaries."""
  homologues = []

  with open(file_path, "r") as f:
    for line in f:
      if not line.startswith(">"):
        continue

      parts = line.strip().split()

      # Skip query header (no stat columns) or malformed lines
      if len(parts) < 10:
        continue

      try:
        hit = dict(zip(FASTA_COLUMNS, parts, strict=True))
        for col in ["q_start", "q_end", "q_len", "t_start", "t_end", "t_len"]:
          hit[col] = int(hit[col])
        for col in ["bit_score", "identity", "e_value"]:
          hit[col] = float(hit[col])

        # Query Coverage
        if q_len > 0 and hit["q_end"] > hit["q_start"]:
          aligned_residues = hit["q_end"] - hit["q_start"] + 1
          cov = min((aligned_residues / q_len) * 100, 100.0)
        else:
          cov = 0.0

        # Alignment length (target span)
        if hit["t_end"] > hit["t_start"]:
          aln_len = hit["t_end"] - hit["t_start"] + 1
        else:
          aln_len = 0

        hit |= {
            "target_id": hit["target"][1:],  # Strip leading '>'
            "q_cov": cov,
            "aln_len": aln_len,
        }
        homologues.append(hit)
      except (ValueError, IndexError):
        print(
            f"[!] Warning: S

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is protein-sequence-similarity-search for?

When should I use protein-sequence-similarity-search?

Is protein-sequence-similarity-search safe to install?

SKILL.md

This week for builders

Overview

Install

What is this skill?

What problem does it solve?

Who is it for?

When should I use this skill?

What do I get? / Deliverables

Recommended Skills

Journey fit

Who is protein-sequence-similarity-search for?

When should I use protein-sequence-similarity-search?

Is protein-sequence-similarity-search safe to install?

SKILL.md