
Deep Research
Query arXiv with correct field syntax, categories, and pagination when validating technical ideas or surveying papers for an agent or ML product.
Overview
Deep Research is an agent skill most often used in Idea (also Build backend and Operate iterate) that queries the arXiv API with correct syntax, categories, and rate-aware pagination.
Install
npx skills add https://github.com/lingzhi227/agent-research-skills --skill deep-researchWhat is this skill?
- Documents arXiv export API base URL and Atom XML response format
- Field prefixes: ti, au, abs, all, cat with AND/OR/ANDNOT and grouping
- Maps common CS and q-bio categories (cs.AI, cs.CL, cs.LG, cs.CV, etc.)
- Pagination via start and max_results with a 100-results-per-request cap
- Rate limit guidance: about one request every three seconds
- Max 100 results per arXiv API request
- Guidance of 1 request per 3 seconds rate limit
Adoption & trust: 779 installs on skills.sh; 114 GitHub stars; 1/3 security scanners passed (skills.sh audits).
What problem does it solve?
You need academic or preprint evidence for an AI feature but ad-hoc web search misses papers or misuses arXiv query syntax and rate limits.
Who is it for?
Solo builders and indie researchers scoping ML, NLP, or agent papers before writing specs or choosing baselines.
Skip if: Users who only need consumer web SEO or competitor landing pages without scholarly sources.
When should I use this skill?
User needs deep literature or arXiv-backed research with API parameters, categories, sorting, or pagination.
What do I get? / Deliverables
You obtain reproducible arXiv search URLs and parameter sets your agent can paginate through without exceeding documented request caps.
- Parameterized arXiv search_query strings
- Pagination plan using start and max_results
- Category and sort configuration for reproducible runs
Recommended Skills
Journey fit
Spans multiple journey phases - primary shelf plus alternate fits below.
Idea phase is where founders explore whether a technique is published, feasible, or already saturated—arXiv is the primary shelf for that evidence gathering. Research subphase matches structured literature search, not prototyping code or landing copy.
Where it fits
Run cat:cs.AI and all:"language model" queries to see if your agent idea is already crowded in recent preprints.
Scan cs.MA listings to discover multi-agent system trends before writing a positioning doc.
Embed arXiv fetch parameters in a small research sidecar service with 100-result pages.
Re-query lastUpdatedDate sorts monthly to watch new papers affecting your model roadmap.
How it compares
Structured arXiv integration reference—not a general Perplexity-style deep web crawl skill.
Common Questions / FAQ
Who is deep-research for?
Indie builders and agent authors who need programmatic arXiv literature search with field prefixes, categories, and pagination discipline.
When should I use deep-research?
In Idea research when validating novelty; during Build when picking model families from recent cs.CL or cs.LG papers; in Operate iterate when monitoring new preprints in your niche.
Is deep-research safe to install?
The skill describes public HTTP queries only; review the Security Audits panel on this Prism page and avoid piping untrusted XML into unsafe parsers in your own scripts.
SKILL.md
READMESKILL.md - Deep Research
# API Reference Guide ## arXiv API ### Base URL ``` http://export.arxiv.org/api/query ``` ### Query Parameters | Parameter | Description | Example | |-----------|-------------|---------| | `search_query` | Search terms with field prefixes | `all:transformer+AND+cat:cs.AI` | | `start` | Offset for pagination | `0` | | `max_results` | Results per page (max 100) | `50` | | `sortBy` | Sort field | `relevance`, `lastUpdatedDate`, `submittedDate` | | `sortOrder` | Sort direction | `descending`, `ascending` | ### Query Syntax - **Field prefixes**: `ti:` (title), `au:` (author), `abs:` (abstract), `all:` (all fields), `cat:` (category) - **Boolean operators**: `AND`, `OR`, `ANDNOT` - **Grouping**: parentheses `()` - **Examples**: - `all:transformer AND cat:cs.CL` — transformers in CL - `au:vaswani AND ti:attention` — Vaswani papers about attention - `(cat:cs.AI OR cat:cs.CL) AND all:"language model"` — LM papers in AI or CL ### Common Categories | Category | Field | |----------|-------| | `cs.AI` | Artificial Intelligence | | `cs.CL` | Computation and Language (NLP) | | `cs.LG` | Machine Learning | | `cs.CV` | Computer Vision | | `cs.MA` | Multiagent Systems | | `cs.SE` | Software Engineering | | `q-bio.BM` | Biomolecules | | `q-bio.GN` | Genomics | | `q-bio.QM` | Quantitative Methods | | `stat.ML` | Machine Learning (Statistics) | ### Rate Limits - **1 request per 3 seconds** (be conservative) - Results are Atom XML format - Max 100 results per request, paginate for more ### Script Usage ```bash python /Users/lingzhi/.claude/skills/deep-research/scripts/search_arxiv.py \ --query "long context reasoning LLM" \ --max-results 50 \ --categories cs.AI cs.CL \ --sort-by relevance \ --start-date 2023-01-01 \ -o results.jsonl ``` ### WebFetch Usage ``` WebFetch http://export.arxiv.org/api/query?search_query=all:transformer+AND+cat:cs.AI&max_results=10&sortBy=relevance ``` Parse the Atom XML response to extract paper entries. --- ## Semantic Scholar Graph API ### Base URL ``` https://api.semanticscholar.org/graph/v1 ``` ### Authentication - API key from `/Users/lingzhi/Code/keys.md` (field `S2_API_Key`) - Header: `x-api-key: <key>` - Without key: 100 requests/5 min. With key: 1 request/second sustained. ### Endpoints #### Paper Search ``` GET /paper/search?query=...&fields=...&offset=0&limit=100 ``` | Parameter | Description | |-----------|-------------| | `query` | Search string | | `fields` | Comma-separated field list | | `offset` | Pagination offset | | `limit` | Results per page (max 100) | | `year` | Year range filter (e.g., `2020-2026`, `2024-`, `-2020`) | | `fieldsOfStudy` | Filter by field (e.g., `Computer Science`) | | `venue` | Filter by venue | #### Paper Details ``` GET /paper/{paper_id}?fields=... ``` `paper_id` can be: Semantic Scholar paperId, `arxiv:2401.12345`, `DOI:10.xxx`, `PMID:xxx` #### Citations ``` GET /paper/{paper_id}/citations?fields=...&limit=1000 ``` Returns papers that cite the given paper. #### References ``` GET /paper/{paper_id}/references?fields=...&limit=1000 ``` Returns papers referenced by the given paper. #### Batch Paper Details ``` POST /paper/batch?fields=... Body: {"ids": ["paper_id_1", "arxiv:2401.12345", ...]} ``` Get details for up to 500 papers at once. ### Useful Fields ``` title,authors,abstract,year,venue,citationCount,referenceCount, externalIds,url,publicationDate,tldr,isOpenAccess,openAccessPdf ``` ### Rate Limits - **Public**: 100 requests per 5 minutes (burst) - **Authenticated**: 1 request/second sustained, 10/second burst - On 429: exponential backoff (2s, 4s, 8s) ### Script Usage ```bash python /Users/lingzhi/.claude/skills/deep-research/scripts/search_semantic_scholar.py \ --query "long horizon reasoning LLM agent" \ --max-results 100 \ --min-citations 10 \ --year-range 2022-2026 \ --api-key <key> \ -o results.jsonl ``` ### WebFetch Usage ``` WebFetch https://api.semanticscholar.org/graph/v1/paper/search?query=long+horizon+reasoning&