
Citation Management
Automatically find missing citations in LaTeX manuscripts by scanning claims and querying Semantic Scholar for BibTeX candidates.
Overview
citation-management is an agent skill most often used in Build (also Idea research, Grow content) that harvests missing LaTeX citations via Semantic Scholar and candidate BibTeX output.
Install
npx skills add https://github.com/lingzhi227/agent-research-skills --skill citation-managementWhat is this skill?
- Scans LaTeX for under-cited factual sentences using claim heuristics
- Calls Semantic Scholar search API and outputs candidate BibTeX entries
- Stdlib-only Python harvest script with --dry-run, --max-rounds, and --verbose
- CLI: python harvest_citations.py --tex main.tex --bib references.bib --output candidates.bib
- Semantic Scholar graph v1 paper search API
Adoption & trust: 864 installs on skills.sh; 114 GitHub stars; 0/3 security scanners passed (skills.sh audits).
What problem does it solve?
Your LaTeX draft makes factual claims without \\cite commands and manual literature search is slowing publication.
Who is it for?
Solo academics and indie researchers maintaining main.tex plus references.bib who want automated citation gap detection.
Skip if: Writers not using LaTeX/BibTeX or anyone needing guaranteed peer-review–grade citation verification without human review.
When should I use this skill?
You have a .tex manuscript and references.bib and need to find citations for factual sentences lacking \\cite.
What do I get? / Deliverables
You get a candidates.bib (or dry-run report) of Semantic Scholar–matched references mapped to under-cited sentences in your .tex file.
- candidates.bib with proposed BibTeX entries
- Dry-run or verbose logs of under-cited claim matches
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Academic and technical writing most often happens while documenting research outputs during build, even though citation hygiene also supports validate and launch content. Docs is the primary shelf because the workflow centers on .tex manuscripts, references.bib, and BibTeX candidate output—not app runtime code.
Where it fits
Map which claims in an outline need sources before you commit to a full experiment write-up.
Run harvest_citations.py on main.tex to populate candidates.bib before submitting to arXiv.
Refresh citations on a technical blog post exported to LaTeX for SEO-heavy evergreen content.
How it compares
A focused LaTeX citation harvester, not a full Zotero integration or generic web research MCP.
Common Questions / FAQ
Who is citation-management for?
Solo builders and researchers drafting LaTeX papers who want agent-assisted Bibliography expansion from Semantic Scholar.
When should I use citation-management?
During Build → docs while drafting papers; in Idea → research when surveying literature gaps; in Grow → content when refreshing cited long-form posts.
Is citation-management safe to install?
The script uses network calls to Semantic Scholar; review the Security Audits panel on this page and inspect harvest_citations.py before running on confidential drafts.
SKILL.md
READMESKILL.md - Citation Management
#!/usr/bin/env python3 """Harvest missing citations for a LaTeX paper. Scans .tex for under-cited claims (sentences with factual assertions but no \\cite), generates search queries, calls Semantic Scholar API, and outputs candidate BibTeX entries. Self-contained: uses only stdlib. Usage: python harvest_citations.py --tex main.tex --bib references.bib --output candidates.bib python harvest_citations.py --tex main.tex --bib references.bib --max-rounds 10 --dry-run python harvest_citations.py --tex main.tex --bib references.bib --output candidates.bib --verbose """ import argparse import json import os import re import sys import time import urllib.error import urllib.parse import urllib.request S2_API = "https://api.semanticscholar.org/graph/v1/paper/search" CLAIM_PATTERNS = [ r"(?:has been shown|have been shown|was shown|were shown|is known|are known)", r"(?:recent(?:ly)?|prior|previous) (?:work|studies?|research|methods?|approaches?)", r"(?:state[- ]of[- ]the[- ]art|SOTA|benchmark)", r"(?:outperform|surpass|exceed|achieve|obtain|report|demonstrate|propose|introduce)", r"(?:widely used|commonly used|popular|well-known|established)", r"(?:inspired by|motivated by|based on|building on|following)", r"(?:\d+\.?\d*)\s*%", # Numbers that likely need citation ] COMMON_WORDS = { "a", "an", "the", "of", "in", "on", "at", "to", "for", "and", "or", "is", "are", "was", "were", "be", "been", "with", "from", "by", "as", "we", "our", "this", "that", "these", "those", "it", "its", } def extract_existing_keys(bib_content: str) -> set[str]: """Extract all BibTeX keys from .bib file.""" return set(re.findall(r"@\w+\{([^,]+),", bib_content)) def extract_cited_keys(tex_content: str) -> set[str]: """Extract all cited keys from .tex file.""" keys = set() for match in re.findall(r"\\cite[a-z]*\{([^}]+)\}", tex_content): for key in match.split(","): keys.add(key.strip()) return keys def find_uncited_claims(tex_content: str) -> list[dict]: """Find sentences with factual claims that lack citations.""" # Remove comments text = re.sub(r"%.*$", "", tex_content, flags=re.MULTILINE) # Remove math environments text = re.sub(r"\$\$.*?\$\$", "", text, flags=re.DOTALL) text = re.sub(r"\$.*?\$", "", text) # Remove commands but keep text text = re.sub(r"\\(?:begin|end)\{[^}]+\}", "", text) sentences = re.split(r"(?<=[.!?])\s+", text) claims = [] for sent in sentences: sent = sent.strip() if not sent or len(sent) < 30: continue # Skip if already has a citation if re.search(r"\\cite", sent): continue # Check for claim patterns for pattern in CLAIM_PATTERNS: if re.search(pattern, sent, re.IGNORECASE): # Extract key terms for search query words = re.findall(r"[A-Za-z]+", sent) content_words = [w for w in words if w.lower() not in COMMON_WORDS and len(w) > 2] query = " ".join(content_words[:8]) claims.append({ "sentence": sent[:200], "pattern": pattern, "query": query, }) break return claims def search_semantic_scholar(query: str, limit: int = 3, api_key: str = "") -> list[dict]: """Search Semantic Scholar for papers matching the query.""" params = urllib.parse.urlencode({ "query": query, "limit": limit, "fields": "title,authors,year,venue,externalIds,citationCount,abstract", }) url = f"{S2_API}?{params}" headers = {"User-Agent": "SkillScript/1.0"} if api_key: headers["x-api-key"] = api_key try: req = urllib.request.Request(url, headers=headers) with urllib.request.urlopen(req, timeout=15) as resp: data = json.loads(resp.read().decode("utf-8")) return data.get("da