Github Research

The canonical shelf is Idea research because intake explicitly seeds discovery from paper_db.jsonl and synthesis artifacts produced upstream. Research subphase fits the phased intake → discovery → evaluation workflow aimed at mapping the open-source landscape, not shipping code.

Also useful

Also useful

Where it fits

Example use

After deep-research emits paper_db.jsonl, run intake to harvest repo URLs and tiered keywords before picking a problem to build.

Example use

IdeaFind the right tools

Use the discovery matrix to compare star-sorted landscape repos versus paper-title best-match hits for a niche method.

Example use

Narrow prototype scope to two GitHub baselines that papers explicitly reference instead of guessing framework popularity.

Example use

Choose fork-or-wrap targets from mapped repos when wiring an agent tool to an existing OSS implementation.

How it compares

Use as a structured research workflow after deep-research, not as a generic GitHub CLI or star-ranking bookmark tool.

Common Questions / FAQ

Who is github-research for?

Solo builders and small teams building research agents who need repeatable GitHub discovery seeded from academic or synthesis outputs.

When should I use github-research?

In Idea research after deep-research runs, again in Validate when scoping which OSS baselines to prototype, and in Build integrations when picking reference implementations.

Is github-research safe to install?

Check the Security Audits panel on this page; the skill implies GitHub and file reads on research directories—review tokens and local data paths before automating.

SKILL.md

READMESKILL.md - Github Research

# GitHub Research — Phase Guide

Detailed methodology reference for the github-research skill.

## Phase 1: Intake — Detailed Guide

### Purpose
Extract structured information from deep-research output to seed GitHub discovery.

### Input Requirements
- Deep-research output directory containing:
  - `paper_db.jsonl` (required)
  - `phase4_code/code_repos.md` (optional but valuable)
  - `phase5_synthesis/synthesis.md` (optional)
  - `phase6_report/report.md` (optional)

### Keyword Extraction Strategy
- **Primary keywords**: From paper titles — extract 2-3 word technical phrases
- **Secondary keywords**: From paper tags in paper_db.jsonl
- **Tertiary keywords**: Method names, algorithm names, architecture names from synthesis
- **Author-based**: Search for prolific authors' GitHub profiles

### Expected Output
- 5-20 GitHub URLs directly from papers
- 10-30 search keywords of varying specificity
- Clear mapping: which papers mention which repos

### Edge Cases
- No code_repos.md: rely entirely on paper_db.jsonl keywords
- No paper_db.jsonl: ask user for manual topic keywords
- Non-English papers: extract English technical terms only

---

## Phase 2: Discovery — Detailed Guide

### Search Strategy Matrix
| Strategy | Query Pattern | Sort | When to Use |
|----------|--------------|------|-------------|
| Broad topic | "multi-agent LLM framework" | stars | Always — establishes landscape |
| Paper title | "{exact paper title}" | best-match | For each key paper |
| Method name | "{algorithm name} implementation" | stars | For specific techniques |
| Author search | "{author name}" + topic | updated | For prolific researchers |
| Code pattern | "class {ClassName}" | - | For specific implementations |
| Language-specific | topic + language:python | stars | When language matters |
| Awesome list | "awesome-{topic}" | stars | To find curated lists |

### Rate Limiting
- GitHub search API: 30 requests/minute (unauthenticated), 10 requests/minute (code search)
- Papers With Code API: ~60 requests/minute
- Always set GITHUB_TOKEN for higher limits (5000 req/hr)

### Deduplication
- Primary key: `repo_id` (owner/name, case-insensitive)
- When merging duplicates: keep record with more populated fields; merge paper_ids lists

### Target Numbers
- Aim for 50-200 unique repos before filtering
- Use at least 5 different search queries
- Check Papers With Code for all papers with arxiv_ids

---

## Phase 3: Filtering — Detailed Guide

### Scoring Deep Dive

**Activity Score** (0-1):
- Days since last push: <30d -> 0.9-1.0, 30-90d -> 0.6-0.8, 90-365d -> 0.3-0.5, >365d -> 0.0-0.2
- Frequency weight: pushed_at recency matters most

**Quality Score** (0-1):
- Stars (log-scaled, 30% weight): log(stars+1) normalized across set
- Forks (log-scaled, 20%): log(forks+1) normalized
- Has license (15%): any recognized license = 1.0
- Not archived (20%): archived repos get 0
- Has README (15%): non-empty readme_excerpt = 1.0

**Relevance Score** (0-1, manually assigned):
- 0.9-1.0: Direct implementation of a paper in the literature review
- 0.7-0.89: Closely related technique or framework
- 0.5-0.69: Related but tangential (e.g., general ML framework used by papers)
- 0.3-0.49: Loosely related (e.g., same domain, different approach)
- 0.0-0.29: Unlikely useful

**Composite**: relevance x 0.4 + quality x 0.35 + activity x 0.25

### Selection Criteria
- Always include: repos directly linked to papers
- Prefer: repos with tests, documentation, active maintenance
- Diversity: ensure mix of approaches, not just top-starred
- Minimum: 15 repos; Maximum: 30 repos

---

## Phase 4: Deep Dive — Detailed Guide

### What "Deep Dive" Means
This is NOT a README scan. You must:
1. Clone the repo (shallow)
2. Read the directory structure
3. Open and read key source files (model definitions, training loops, core algorithms)
4. Trace the execution flow from entry point to core logic
5. Evaluate code quality, documentation, test coverage

### Per-Repo Analysis Template
```mar

What is this skill?

Intake phase extracts 5–20 repo URLs and 10–30 keywords from paper_db.jsonl and optional code_repos.md

Search strategy matrix: broad topic by stars, paper title best-match, method names, and author profiles

Keyword tiers: title phrases, paper tags, synthesis method names, and prolific author GitHub hunts

Edge-case handling when code_repos.md or paper_db.jsonl is missing

Maps which papers mention which repositories for traceable discovery

5–20 GitHub URLs expected from intake

10–30 search keywords of varying specificity

4-row search strategy matrix in discovery phase

Compatible agents: Claude Code, Cursor, Codex, Windsurf

Adoption & trust: 725 installs on skills.sh; 114 GitHub stars; 2/3 security scanners passed (skills.sh audits).

Journey fit

Spans multiple journey phases - primary shelf plus alternate fits below.

Primary fit

Also useful

Also useful

Where it fits

Example use

After deep-research emits paper_db.jsonl, run intake to harvest repo URLs and tiered keywords before picking a problem to build.

Example use

IdeaFind the right tools

Use the discovery matrix to compare star-sorted landscape repos versus paper-title best-match hits for a niche method.

Example use

Narrow prototype scope to two GitHub baselines that papers explicitly reference instead of guessing framework popularity.

Example use