Math Olympiad

Name: Math Olympiad
Author: anthropics

anthropics/claude-plugins-official

1.9k installs
32.9k repo stars
Updated July 31, 2026
anthropics/claude-plugins-official

A verified-correct solution to a competition math problem (IMO, Putnam, USAMO, AIME) with explicit confidence calibration: 'solved' (4-verifier consensus or majority vote), 'partial result' (some lemmas verified, others

About

Math Olympiad Solver is a reasoning-first agent that solves competition math problems using pure thinking, then submits cleaned solutions to pattern-armed adversarial verifiers in fresh context. It launches 8 - 12 parallel solver agents, each iterating internally (solve, self-improve, self-verify, correct), then strips thinking traces before verification to eliminate confirmation bias. Verifiers attack specific failure patterns (tautological reasoning, open-problem specializations, overgeneralized one-liners, hypothesis violations) across up to 5 independent instances. A 4-to-confirm/2-to-refute asymmetric vote gates confidence. If verification fails, a reviser agent fixes the proof without access to original reasoning; if abstention remains after 3 cycles, the skill returns honest partial results rather than guessing. For numeric AIME answers, majority-vote solvers skip proof machinery. All outputs pass a calibrated confidence check and, when correct, receive a presentation pass to optimize elegance.

8-12 parallel solver agents with internal iterate-refine loops; each independently self-verifies before returning
Adversarial verification in fresh context, blind to solver thinking traces and other verifiers' verdicts, using 12 speci
Asymmetric vote mechanism: 4 HOLDS to confirm, 2 HOLE FOUND to refute; early exit via pigeonhole principle to save ~30%
Interpretation-check phase before solving catches 50 of 63 "correct" answers that solved the wrong reading; verifier-fou
Calibrated abstention with partial-result honesty; deep-mode single-agent fallback (bounded local computation, no web) f

Math Olympiad by the numbers

1,897 all-time installs (skills.sh)
+287 installs in the week ending Jul 29, 2026 (Skillselion tracking)
Ranked #654 of 16,565 AI & Agent Building skills by installs in the Skillselion catalog
Security screen: LOW risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

At a glance

math-olympiad capabilities & compatibility

Per-problem cost ≈ (8 solvers × think time) + (up to 5 verifiers × reasoning) + (up to 3 revise agents × reasoning) + (1 presentation agent × reasoning); Haiku

Capabilities: parallel multi agent orchestration (8 12 solver · pattern armed adversarial verification (12 speci · asymmetric voting and early exit pigeonhole logi · context isolation enforcement (thinking trace st · revision workflows with gap specific fixing · deep mode fallback with bounded local computatio · interpretation checking phase · calibrated confidence reporting and honest abste
Platforms: macOS · Linux · Windows

npx skills add https://github.com/anthropics/claude-plugins-official --skill math-olympiad

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

$Listed on Skillselion$

[![Listed on Skillselion](https://skillselion.com/badge/skills/anthropics/claude-plugins-official/math-olympiad.svg)](https://skillselion.com/skills/anthropics/claude-plugins-official/math-olympiad)

Installs	1.9k
repo stars	★ 32.9k
Security audit	3 / 3 scanners passed
Last updated	July 31, 2026
Repository	anthropics/claude-plugins-official ↗

What it does

Solve and verify competition math proofs (IMO, Putnam, USAMO) with adversarial verification to catch reasoning errors.

Who is it for?

Solving individual competition math problems (IMO, Putnam, USAMO, AIME) where proof rigor and interpretation correctness are critical; batching full problem sets (6-problem competitions) in parallel; deep-dive analysis o

Skip if: Homework help or take-home exams; web-based problem lookup or literature search; exploratory mathematics research (distinct `math-research` workflow); real-time tutoring or student evaluation; problems requiring speciali

When should I use this skill?

Developer receives an IMO, Putnam, USAMO, or AIME problem and needs a verified solution with explicit confidence; or developer wants to batch-solve a full competition problem set; or developer has a proof and wants adver

What you get

Developers get a reasoning agent that autonomously solves competition math by (1) launching parallel solvers with internal refinement, (2) performing pattern-armed adversarial verification in fresh context, (3) voting wi

Verified proof write-up
Counterexample or gap report
Polished olympiad solution

By the numbers

Interpretation check catches 50 of 63 (79%) errors in prior work (Aletheia case study)
Internal solver refinement (Yang-Huang structure) achieves 85.7% on IMO with 8-12 parallel attempts
Asymmetric vote + pigeonhole exit saves ~30% verification cost on clear cases

Files

SKILL.mdMarkdownGitHub ↗

Math Olympiad Solver

The five things that change outcomes

1. Strip thinking before verifying — a verifier that sees the reasoning is biased toward agreement. Fresh context, cleaned proof only. 2. "Does this prove RH?" — if your theorem's specialization to ζ is a famous open problem, you have a gap. Most reliable red flag. 3. Short proof → extract the general lemma — try 2×2 counterexamples. If general form is false, find what's special about THIS instance. 4. Same gap twice → step back — the case split may be obscuring a unified argument. Three lines sometimes does what twelve pages couldn't. 5. Say "no confident solution" — wrong-and-confident is worse than honest abstain.

---

Tool policy: Solvers and verifiers use THINKING ONLY in the tight-budget workflow. Competition math is reasoning. Computation is for deep mode (§6c), and even then bounded — a recurrence that's doubly-exponential can't be computed past n~30, work mod 2^m instead.

---

When to use which approach

Problem	Approach	Verification
AIME numeric answer	Best-of-N → majority vote	Answer check only
Olympiad proof (IMO/Putnam/USAMO)	Full workflow below	5-pass adversarial
"Is this proof correct?"	Skip to verification (step 4)	Adversarial + spec-gaming
Full problem set (e.g. all 6 from a competition)	Sequential: one full workflow per problem, collect results, compile single PDF	Per-problem adversarial

Batch in one Workflow: Set opts.label on every agent() call to include the problem ID (e.g., label: "P3:solver:2"). Without labels, 36 results come back with no problem association. Run problems in parallel — the label is what matters, not ordering.

For a full problem set

Launch one solver workflow per problem (same VERBATIM prompt, different statement). Run them in parallel. When all return, run adversarial verification per problem. Problems that pass get their proof in the PDF; problems that abstain get "No confident solution" with partial notes.

Don't try to solve all N problems in one agent's context — each problem needs its own thinking budget and its own fresh-context verifier. The composition is mechanical: collect the per-problem outputs, fill in LaTeX sections, compile once. | "Simplify this proof" | Skip to presentation (step 8) | — |

---

The Workflow

1. Interpretation check (30 seconds, catches 50/63 of one class of errors)

Before solving anything, identify the interpretation.

Read the problem statement. List 2-3 ways it could be interpreted. For each:

is this reading TRIVIAL? If one reading makes the problem easy and another

makes it hard, the hard one is almost certainly intended. State which

interpretation you're solving and WHY you believe it's the intended one.

The Aletheia case study found 50 of 63 "technically correct" solutions were for the wrong interpretation. Olympiad problems often have a trap easy reading.

2. Generate candidates with internal refinement (parallel, thinking only)

Launch 8-12 attempt agents in parallel. Each agent internally iterates — solve → self-improve → self-verify → correct → repeat. This is the Yang-Huang structure that achieves 85.7% on IMO: one-shot solving isn't enough; per-attempt refinement matters.

The Agent tool cannot enforce tool restriction. Subagents get the full tool set. The only mechanism is the prompt. Use this prompt VERBATIM — do not summarize, do not synthesize your own:

NO COMPUTATION. Do not use Bash, Python, WebSearch, Read, Write, or any tool that runs code or fetches data. Numerical verification is not a proof step. "I computed n=1..10 and the pattern holds" is not a proof.

(If your agent harness requires a StructuredOutput or similar return-mechanism tool call, that is NOT a computation tool — call it to return your answer. The restriction is on tools that DO work, not tools that REPORT work.)

Your internal process (iterate until done):
- Solve: Complete rigorous solution.
- Self-improve: Reread. Fix gaps before a grader sees it.
- Self-verify: Strict grader mode. Every step justified?
- Correct: Fix and re-verify. Up to 5 rounds.
- Stop: Self-verify passes twice clean, OR 5 rounds, OR approach fundamentally wrong.

A correct answer from flawed reasoning is a failure. If incomplete, say so honestly. Never hide gaps.

PROBLEM: <insert the problem statement here>
ANGLE: <insert one starting angle here>

The first two paragraphs are load-bearing. A session that writes its own prompt and omits them will produce subagents that grind Python for 30 iterations and confidently get wrong answers — a pattern that fits n≤10 but fails at n=100 is not a proof.

Starting angles (vary across agents — see references/solver_heuristics.md):

Work out small cases (test past n=3)
Look for an invariant or monovariant
Consider the extremal case
Try induction
What symmetries?
Work backwards
Drop a condition — where does it become trivially false?
Generalize (inventor's paradox — more structure is sometimes easier)

Each returns its FINAL state (not intermediate rounds):

**Verdict**: complete solution | partial result | no progress
**Rounds**: [how many verify→correct cycles]
**Method**: [key idea, one paragraph]
**Detailed Solution**: [full step-by-step, every step justified]
**Answer**: [if applicable]
**Self-verification notes**: [what you caught and fixed; remaining concerns]

Retry policy: If an agent fails or times out, retry once. Transient failures happen.

3. Clean the solution (context isolation — the #1 lever)

The thinking trace biases the verifier toward agreement — a long chain of reasoning reads as supporting evidence even when the conclusion is wrong. Before any verification, strip:

All thinking-block content
All "Let me try..." / "Actually wait..." / "Hmm" prose
All false starts and backtracking

What remains: problem statement + clean final argument only.

Extract only the Method + Proof + Answer sections from each solver's output. The verifier never sees how the solver got there.

4. Adversarial verify (fresh context, pattern-armed)

For each cleaned solution, launch a fresh verifier agent. Fresh context: it sees only (problem statement + cleaned solution). No tools.

The verifier's job is to ATTACK, not grade. Load references/adversarial_prompts.md for the prompts. The key patterns it runs:

Pattern	The check
#4	Does this theorem specialize to a famous object (ζ, quadratic reciprocity, etc.) and prove something open about it? → gap
#18	Substitute the proof's own intermediate identities into any "remaining gap." Recover the original claim? → tautological
#40	Is any step a "one-line lemma"? Extract the GENERAL form. Find a 2×2 counterexample. If the general form is false, find what special structure saves THIS instance
#5	For each invoked theorem: re-check hypotheses FROM SCRATCH. "Continuous on [0,1]" ≠ "continuous on ℝ"
#6	Any infinite sum "bounded" via a regularized value? Check the boundary — if there's a pole there, the sum diverges

Full pattern list: references/verifier_patterns.md

Verifier returns:

**Verdict**: HOLDS | HOLE FOUND | UNCLEAR

**If HOLE FOUND**:
- Location: [quote the problematic step]
- Pattern: [which check fired, or "other"]
- Why it breaks: [specific]
- Fixable?: [yes with X / no, fundamental]

5. Rank and vote-verify (asymmetric + early exit)

Rank solutions by (verdict, verifier confidence). Take the top one. Run up to 5 fresh verifier agents.

Asymmetric thresholds: 4 HOLDS to confirm, 2 HOLE FOUND to refute. Why asymmetric: one flaky verifier shouldn't kill a correct proof; but two independent dissents is a real signal.

Pigeonhole early exit: stop launching verifiers once the outcome is decided.

2 say HOLE FOUND → refuted, stop (save the remaining 3 calls)
4 say HOLDS → confirmed, stop (save the 5th)
After 3 verifiers: if 2 HOLDS + 1 HOLE, launch 2 more (outcome undecided). If

3 HOLDS + 0 HOLE, launch 1 more (could still hit 4-1).

Dual context-isolation: each verifier is blind to (a) the solver's thinking trace — already stripped in step 3 — AND (b) other verifiers' verdicts. Each verifier thinks it's the first. No "3 agents already confirmed this" social proof.

A solver cannot verify its own solution. Different agent, fresh context.

5b. When one case won't close — step back before grinding

If a proof splits into cases and one case proves easily but the other resists: before grinding through the hard case, ask whether there's a route that makes the split disappear.

The pattern that saves you: the hard case's very hypothesis often implies something strong about an _intermediate object_ you haven't looked at. Use that implication directly instead of the original chain.

Concrete shape: proving f(n) ≤ cn for a constrained function f, with a case split on a prime p dividing f(n). One branch closes by index arguments in (ℤ/p^e)\. The other branch resists — same group structure, but the arithmetic doesn't contradict. The fix: the hypothesis "p | f(n)" plugged back into the governing equation implies f(p) = p itself*. Once you have that, a Fermat+Dirichlet argument kills both branches in three lines. The case split was a detour — it was splitting on a variable that, under the hypothesis, takes a known value.

Check when stuck on case B:

What does case B's hypothesis imply about f at _other_ inputs?
Is there a different pair (a,b) to plug into the governing equation?
Are you proving too much? (A cleaner contradiction needs less machinery.)

This is also a presentation-pass win: the split-free proof is shorter AND more general.

6. Revise (if needed)

If verification finds a hole: launch a reviser agent. It gets (cleaned solution + verifier's hole report). STILL no access to the original thinking — the reviser works from the hole, not by rereading how you got there.

A verifier found this issue in the proof:
[hole report]

Fix the proof. If the hole is fundamental (the approach doesn't work), say so and return **Verdict: no confident solution** with what partial progress remains.

For any step you cannot fully close, mark it inline: [GAP: specific description of what remains]. Gaps in the proof text, not in a separate list — they're greppable and the next reviser knows exactly where to look.

Up to 3 revise cycles. Then re-run the vote on the revised proof.

If pattern #40 fired (one-line-proof-too-clean), the reviser gets a stronger brief — the Adversarial Brief template from references/adversarial_prompts.md §7. It forces a binary: "the general lemma is obviously false (here's a 2×2 counterexample) — so either find what's special about THIS case, or find where the proof breaks." Can't return "looks fine."

6c. Deep mode (when tight-budget abstains)

The standard workflow is tight-budget: 8 solvers, ~15 min, pure reasoning. When it abstains, the problem may need more time, not more capability.

Deep mode is a single focused agent with:

Unlimited time — no wall-clock pressure
Targeted computation allowed — modular arithmetic checks, small-case

enumeration, symbolic verification of identities. NOT exploratory brute force or unbounded recursion.

The abstention reason as starting point — if verifiers found a specific

gap, start there. If solvers never claimed complete, start from what they partially proved.

The archetype: a focused agent that gets the proven-so-far state plus "one case of Lemma 5 is open" — and finds a 3-line argument the case split was obscuring. Often under 10 minutes with almost no computation. Deep mode is about giving the problem sustained attention, not throwing compute at it.

What deep mode is NOT: open-ended exploration, literature search, looking up solutions, multi-day investigation. That's a different workflow (math-research). Deep mode is still "solve THIS problem yourself" — just without the clock.

NO WEB. NO LOOKUP. Deep mode may use Bash/Python for bounded computation, but NEVER WebFetch, WebSearch, or any network access. Finding the solution on AoPS or a blog is not solving the problem — it's cheating on an olympiad, and it teaches us nothing about the skill's actual capability. Put this at the TOP of the deep-mode prompt:

NO WEB ACCESS. Do not use WebFetch, WebSearch, or any tool that touches the internet. Do not look up this problem, its solution, or related problems. You are solving this yourself — the only allowed computation is local (Bash/Python for mod-k arithmetic, small-case enumeration n≤10, symbolic identity checks). If you invoke a web tool, the proof is void.

Computation bounds in deep mode (bug #8 lesson): A6's b\_{n+1}=2b_n²+b_n+1 is doubly-exponential; b_99 has ~10^{2^98} digits. Never compute such objects exactly — work in ℤ/2^m, or track only v_p(·), or prove the recursion mod the quantity you care about. If a computation is running longer than 60 seconds, it's probably unbounded. Kill it and work symbolically.

Step 6d (not optional): After any ABSTAIN at the verify stage, automatically launch one deep-mode agent before writing the abstention into the output. Give it:

The problem statement
The best partial proof from tight-budget solvers
The verifier gap descriptions (what specifically didn't close)
The instruction: "NO WEB ACCESS — do not look up this problem or its solution.

Bounded local computation allowed (mod 2^k, small cases n≤10, symbolic identity checks via Bash/Python only). 60-second computation limit. If n≤10 brute force reveals a pattern the tight-budget solvers missed, that pattern IS the proof structure."

The deep agent may find the construction the pure-reasoning solvers couldn't see. If it also abstains, THEN write the abstention. Do not skip this step — problems with √n or log n answers are often invisible to pure reasoning because the optimal structure is the asymmetric one.

Orchestrator self-restraint: The orchestrator itself must not web-search the problem "to help" the deep agent. If you're tempted to Fetch an AoPS thread "just to check the answer," don't — that contaminates the skill's output and misrepresents its capability.

7. Calibrated abstention

If 3 revise cycles all fail: stop and admit it.

**Verdict**: no confident solution

**What was tried**: [approaches]
**What WAS proven**: [any lemma or partial result that survived verification]
**Where it breaks**: [the unfixed hole]

Do NOT guess. A wrong confident answer is worse than an honest "couldn't solve it." The metric that matters is CONDITIONAL accuracy — when you say "solved," are you right?

8. Presentation pass (after correctness is established)

A VERIFIED-CORRECT proof is often not a BEAUTIFUL proof. The order you discovered it is rarely the best order to present it. Launch a fresh presentation agent with the verified proof.

Load references/presentation_prompts.md. The agent asks:

What's the simplest way to say this?
Which lemmas should be inlined? Which deserve to stand alone?
Is anything OVERKILL? (constructing a double exponential when linear suffices)
Now that we know the answer, is there a 3-line hindsight proof?

Output: LaTeX-formatted proof. If pdflatex is available (scripts/check_latex.sh returns 0), also compile to PDF via scripts/compile_pdf.sh.

---

Model tier defaults

Read references/model_tier_defaults.md for full details. Summary:

Model	Solvers	Verify passes	Abstain after	Presentation
Haiku	8	3	2 revise fails	skip
Sonnet	4	5	3 revise fails	yes
Opus	3	5 + full pattern sweep	4 revise fails	2 drafts, pick cleaner

Weaker models: more parallel attempts, faster abstention. Stronger models: deeper verification, more presentation effort.

---

For numeric-answer problems (AIME-style)

Skip the proof machinery. Run 5-7 solvers with varied approaches, take majority vote on the numeric answer. If no majority: verify the top 2 candidates by substitution.

---

Key references

references/verifier_patterns.md — the 12 adversarial checks
references/adversarial_prompts.md — ready-to-use verifier prompts
references/presentation_prompts.md — beautification prompts + LaTeX template
references/model_tier_defaults.md — per-model configuration

---

What makes this different from generic verify-and-refine

1. Dual context isolation: verifier is blind to (a) the solver's thinking trace — which biases toward agreement — and (b) other verifiers' verdicts — social proof also biases. Each verifier thinks it's first. 2. Pattern-specific attacks: not "is this correct?" but "does this make the #40 mistake? the #4 mistake?" Specific beats generic. The 7-category refutation taxonomy gives the verifier a checklist. 3. Asymmetric vote + pigeonhole exit: 4-to-confirm, 2-to-refute. One flaky verifier doesn't kill a correct proof; two dissents does. Stop launching verifiers once the outcome is decided — saves ~30% of verification cost on clear cases. 4. Specification-gaming check first: explicitly asks "is this the intended interpretation?" before solving. The #1 failure mode in prior work (50/63 "correct" answers solved the wrong reading). 5. Calibrated abstention: will say "no confident solution" with partial results. Optimizes conditional accuracy, not coverage. 6. Presentation pass: correctness and elegance are separate steps. The presentation agent gets the VERIFIED proof and finds the cleanest way to say it.

[
  {"query": "Solve this IMO problem: Let n ≥ 2 be an integer. Prove that...", "should_trigger": true},
  {"query": "Is this Putnam proof correct? Here's my attempt at B3...", "should_trigger": true},
  {"query": "Find a counterexample to: every continuous function on [0,1] is uniformly continuous", "should_trigger": true},
  {"query": "Prove this olympiad inequality: for positive reals a,b,c with a+b+c=1...", "should_trigger": true},
  {"query": "Help me with this USAMO geometry problem", "should_trigger": true},
  {"query": "Verify my solution to AIME 2024 problem 12", "should_trigger": true},
  {"query": "I think there's a gap in this competition proof, can you find it?", "should_trigger": true},
  {"query": "Simplify this proof — it feels overly complicated", "should_trigger": true},
  {"query": "Here's a conjecture from a math competition. Is it true?", "should_trigger": true},
  {"query": "What's the cleanest way to present this olympiad solution?", "should_trigger": true},

  {"query": "Help me verify the time complexity of this sorting algorithm", "should_trigger": false},
  {"query": "Write a Python function that checks if a number is prime", "should_trigger": false},
  {"query": "I'm doing research on the Riemann Hypothesis, where should I start reading?", "should_trigger": false},
  {"query": "Debug this proof assistant code — my Lean tactic isn't working", "should_trigger": false},
  {"query": "Explain the proof of the fundamental theorem of calculus to a high schooler", "should_trigger": false},
  {"query": "What's a good textbook for learning competition math?", "should_trigger": false},
  {"query": "Generate 10 practice problems similar to AIME level", "should_trigger": false},
  {"query": "Compute the integral of x^2 sin(x) dx", "should_trigger": false},
  {"query": "Review my research paper draft on analytic number theory", "should_trigger": false},
  {"query": "What's the difference between IMO and Putnam in difficulty?", "should_trigger": false}
]

Adversarial Verifier Prompts — Math Olympiad

Prompt bank for the verifier subagent. Fresh context: problem statement + cleaned solution, NO thinking trace. Agent has NO tools — pure reasoning only.

Source: shared/verifier_patterns_source.md. Background: arXiv:2503.21934 showed self-verified 85.7% IMO success drops to <5% under human grading. These prompts are the human grader.

Verifier isolation: You do NOT know how other verifiers voted. You are not told if this proof has been confirmed or refuted by anyone else. Assume you're the first and only reviewer. (Social proof — "3 others confirmed" — biases toward agreement.)

---

Reasons to REFUTE (the taxonomy — look for ANY one of these)

Your goal is to find ANY reason to refute. These are the seven categories a hole falls into:

1. Step doesn't follow — The conclusion of some step is not implied by its premises. (Includes direction errors: A>B and C>D does NOT give A−C>B−D.) 2. Hypothesis not satisfied — An invoked theorem needs a condition the proof never verified. (Pattern #5: "entire" ≠ "analytic in a disk".) 3. Claim false in small case — A stated identity or bound fails at n=2, n=3, or the first nontrivial block. Mentally test it. 4. Tautological — The "gap" at the end is the original problem in disguise. (Pattern #18: substitute the proof's own identities back in.) 5. Proves too much — The argument's skeleton applies to a famous object and proves something open or false about it. (Pattern #4.) 6. Wrong interpretation — Solves an easier reading of the problem than the intended one. (Pattern #60.) 7. Hand-wave at the crux — "iterating and optimizing gives the result", "by standard methods", "the details are routine" — at exactly the step that ISN'T routine.

If none of these fire after a genuine attempt, CONFIRM. Do not confirm because the proof _sounds_ confident.

---

1. General Adversarial (default)

You are an adversarial verifier. Below is a problem and a proposed solution.

You are NOT grading this. You are trying to BREAK it. Assume the author is a strong student who made one subtle error that a sympathetic reader would gloss over. Your job is to find that error. If you cannot find one after genuinely trying, say so — but do not say so just because the solution is confidently written.

Attack each step:

Is the claimed inequality actually in the claimed direction? Reason through a

small case mentally.

Is every "clearly" / "obviously" / "it follows that" actually clear? These

words often mark the exact spot where the author convinced themselves of something false.

Does every cited theorem's hypothesis actually hold? Check quantifiers: "for

all" vs "there exists", pointwise vs average.

At each "WLOG": is generality actually preserved, or does the reduction

discard the hard case?

Does the argument use a property that's true for the _generic_ object but not

the _specific_ one in the problem?

You have no tools. Reason about small cases in your head — do not claim to have "computed" anything.

Output format:

VERDICT: CORRECT | INCORRECT | GAP
CONFIDENCE: high | medium | low
ISSUE: [if INCORRECT/GAP: one-sentence location, then one-paragraph explanation. If CORRECT: the step you tried hardest to break and why it held.]

---

2. Pattern #4 — Would It Prove Too Much?

You are an adversarial verifier running a single check: does this argument prove something famously open or famously false?

Read the proposed solution. Ignore whether the proof is locally valid. Instead:

1. Strip the argument down to its skeleton: what properties of the given objects does it _actually use_? 2. Find the most famous object that shares exactly those properties. (If it bounds a sum using only "positive decreasing terms" — does the harmonic series have positive decreasing terms? If it uses only "multiplicative and bounded by 1" — does the Möbius function qualify?) 3. Mentally rerun the argument on that substitute. What does it now prove?

If the substitute conclusion is a known open problem or a known falsehood, the original proof has a gap. The gap is at the step where the argument stops working for the substitute — find that step. That step is silently using a property the author never stated.

If the argument genuinely uses a property specific to the problem's object that the famous substitute lacks, say which property and where it's used.

Output format:

VERDICT: CORRECT | INCORRECT
CONFIDENCE: high | medium | low
SUBSTITUTE_TESTED: [what object you substituted]
ISSUE: [if it proves too much: which step fails for the substitute, and what unstated property is needed. If not: which step uses the specific property and why the substitute fails there.]

---

3. Pattern #40 — One-Line-Proof-Too-Clean

You are an adversarial verifier targeting short proofs. The solution below contains at least one step that is suspiciously brief — one line doing a lot of work.

For the shortest load-bearing step in the solution:

1. Extract the general lemma. Write down the most general claim the step is implicitly using. Not "for this sum" but "for any sum of this shape." Not "for the determinant" but "for any function of the matrix entries with this property." 2. Try to break the general lemma with a 2×2 case. Two elements, two terms, a 2×2 matrix — the smallest nontrivial instance. Reason it through in your head. Can you find values where the general lemma fails? 3. Judge:

If the general lemma survives your 2×2 attack: the step is probably fine.
If the general lemma FAILS at 2×2 but the specific instance in the proof

still seems to work: the step is INCORRECT as written. There is special structure in the problem that makes it true, and the proof does not invoke that structure. The author got the right answer for the wrong reason.

The classic failure: "rank depends only on support" — but [[1,1],[1,1]] has rank 1 and [[1,1],[1,−1]] has rank 2, same support. General lemma false; a specific instance was true because of a sign-factorization the proof never mentioned.

Output format:

VERDICT: CORRECT | INCORRECT | GAP
CONFIDENCE: high | medium | low
GENERAL_LEMMA: [the extracted general claim]
2x2_TEST: [the instance you tried, and what it showed]
ISSUE: [if the general lemma is false: what special structure the proof failed to invoke]

---

4. Pattern #18 — Tautological Reduction

You are an adversarial verifier checking one thing: did the solution argue itself in a circle?

The solution likely proceeds through a chain of reductions or equivalent reformulations, ending at a "final estimate" or "key inequality" that it then proves directly. Your task:

1. List every identity, equality, or substitution the solution establishes along the way. (Things like "A = B + C", "the sum splits as X + Y", "by the earlier lemma, P = Q".) 2. Take the FINAL claim — the one the solution presents as "and this is now easy" or "this follows from [standard fact]". 3. Substitute the chain's OWN identities (from step 1) back into that final claim. Expand. Simplify. 4. What do you get? If you recover the ORIGINAL problem — or something trivially equivalent to it — then the "reduction" is a tautology. The proof has done nothing; it renamed the problem and declared it solved.

The trap: long chains feel like progress. "We've reduced it to bounding X!" is only progress if X is actually different from what you started with. Sometimes X is just the original, wearing a hat.

Output format:

VERDICT: CORRECT | INCORRECT | GAP
CONFIDENCE: high | medium | low
FINAL_CLAIM: [the claim the solution treats as the easy endpoint]
SUBSTITUTED_BACK: [what it becomes after expanding the chain's own identities]
ISSUE: [is it the original problem? trivially equivalent? genuinely simpler? say which and why]

---

5. Pattern #60 — Specification-Gaming

You are an adversarial verifier checking one thing: did the solution answer the easiest interpretation of the question instead of the intended one?

Read the problem statement alone. Before looking at the solution in detail:

1. Write down 2–3 plausible readings of what the problem is asking. Pay attention to: scope of quantifiers ("find all" vs "find one"), what "determine" means (a formula? a characterization? an existence proof?), boundary cases (does n=0 or n=1 count? is the empty set allowed? are degenerate configurations included?). 2. Rank them by how hard they would be to solve. 3. Which reading did the solution actually address?

If the solution addresses the EASIEST reading — and especially if the problem under that reading would be trivially short for its stated source (an IMO problem that becomes a two-liner is a red flag) — then be suspicious. Olympiad problems are calibrated to their point values. A final-problem that falls in three lines means you're probably not solving the final problem.

Also check: did the solution prove something about _an_ object when the problem asked about _all_ such objects? Did it show _possibility_ when the problem wanted _necessity_?

Output format:

VERDICT: CORRECT | INCORRECT | GAP
CONFIDENCE: high | medium | low
READING_SOLVED: [which interpretation the solution addresses]
READING_INTENDED: [which interpretation you believe was intended, and why]
ISSUE: [if they differ: what the solution is missing. If they match: why the easy reading is genuinely the intended one.]

---

6. Consecutive-Verify (5-pass loop)

You are verifier pass {K} of 5. A solution passes only if all five independent verifiers agree.

Verify INDEPENDENTLY. You have not seen — and must not imagine — what any other verifier said. Do not reason "this probably already got checked." Your vote is the only vote you control. If you wave something through on the assumption that another pass will catch it, and the other four passes reason the same way, a wrong solution ships.

Read the problem. Read the solution. Trace every step yourself, from scratch.

One bias to actively resist: when a solution is well-written, confident, and uses standard machinery correctly in _most_ places, you will be inclined to trust the one place you can't quite follow. Invert this. Well-written and confident is exactly what a subtly wrong solution looks like — the author convinced themselves before they convinced the math. The place you can't quite follow is the place to press hardest.

You have no tools. Reason through small cases mentally; do not claim numerical verification.

Output format:

VERDICT: CORRECT | INCORRECT | GAP
CONFIDENCE: high | medium | low
PASS_NUMBER: {K}
ISSUE: [if INCORRECT/GAP: exact step and why. If CORRECT: the step you found hardest to verify, and the reasoning that convinced you it holds.]

---

7. Adversarial Brief (for the reviser when pattern #40 fires)

Use this instead of a general "fix the hole" prompt when a verifier flagged a one-line lemma whose general form is false. This framing forces a binary — the reviser cannot return "looks fine."

Adversarial brief: The principle "[extracted general lemma]" is obviously

false in general — [trivial counterexample, e.g., [[1,1],[1,1]] has rank 1 and

[[1,1],[1,−1]] has rank 2, same support].

So exactly one of these is true, and your job is to determine which:

(A) The conclusion holds for a DIFFERENT reason specific to this case.

Find that reason. What structure does [the specific object in the problem]

have that [the counterexample] lacks? That structure is the real proof.

(B) The proof is wrong and the conclusion fails at [concrete prediction of

where it diverges — e.g., "the first case where the block is ≥2×2, which is

m=4"].

Return (A) with the special structure identified, or (B) with the failure

point. "The original proof is actually fine" is not an available answer — the

general lemma is false, so either something saves this instance or nothing

does.

The best outcome is (A) — the thesis survives AND you learn why. The corrected proof is more informative than the false one.

Output format:

RESOLUTION: (A) SPECIAL_STRUCTURE | (B) CONCLUSION_FALSE
IF (A): The structure [specific object] has that [counterexample] lacks: [...]. Revised proof: [...]
IF (B): Fails at [parameter/case]. Reason: [...]

Solver-Refiner Agent Prompt

You are solving a competition math problem. You have NO tools — pure reasoning only.

Your process (iterate internally until done)

Round 1: Solve

Think deeply. Produce a complete solution.

Round 2: Self-improve

Reread your solution. Fix any errors or gaps you find. This is your chance to catch your own mistakes before a grader does.

Round 3: Self-verify

Switch roles. You are now a strict IMO grader. Check every step. Classify each issue as:

Critical Error: breaks the logical chain (e.g., claiming A>B and C>D

implies A-C>B-D)

Justification Gap: conclusion may be correct but argument incomplete

If you find issues: note them, go back to your solver role, correct the solution, verify again. Repeat up to 5 times.

Stop when: Either your self-verification passes cleanly 2 times in a row, OR you've done 5 correction rounds, OR you're certain the approach is fundamentally wrong.

Core principles (from Yang-Huang IMO25)

Rigor is paramount: A correct final answer from flawed reasoning is a

failure.

Honesty about completeness: If you cannot find a complete solution, say

so. Present significant partial results (key lemma proven, one case resolved, a bound without achievability). Do NOT guess or hide gaps.

Use TeX: All mathematics in $...$ or $$...$$.

Output format (ONLY your FINAL state after all rounds — not the intermediate iterations)

**Verdict**: complete solution | partial result | no progress

**Rounds**: [how many self-verify→correct cycles you ran]

**Method**: [one paragraph: the key idea]

**Detailed Solution**:
[Full step-by-step proof. Every step justified. No "clearly" or "obviously" — justify everything.]

**Answer**: [if the problem asks for a specific value/set/characterization]

**Self-verification notes**: [what you caught and fixed; any remaining concerns]

---

PROBLEM: {statement}

HINT: {angle}

Model Tier Defaults

Parameters scale with model capability. Budget is not the constraint — the constraints are diminishing returns (more voters stop helping past a point) and the asymmetric noise floor (Haiku verifiers are individually less reliable, so the right response is width not depth).

Haiku

Width compensates for per-sample noise. Scaffolding is where the leverage is.

Parallel solvers: 12 (wide fan — each individual solve is weaker, so cast

a wider net)

Vote budget: 7 verifiers, need 5-confirm / 3-refute (pigeonhole exit: stop

when outcome decided)

Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12 patterns — Haiku can follow a checklist, the

patterns are the scaffold

Presentation pass: yes, 3 drafts, comparator picks cleanest. Haiku's raw

output is rougher, so this matters MORE not less.

Rationale: The skill's value is highest where the base model is weakest.

Give Haiku the full harness. The 3-refute threshold (higher than Sonnet's 2) accounts for Haiku verifiers being individually noisier — don't let 2 confused Haikus kill a correct proof.

Sonnet

Balanced.

Parallel solvers: 6
Vote budget: 5 verifiers, need 4-confirm / 2-refute
Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12
Presentation pass: 2 drafts, comparator picks cleaner
Rationale: 4-of-5 tolerates one flake. 2 dissents is signal.

Opus

Depth. Each sample is strong, so invest in making the adversarial pass harder.

Parallel solvers: 4
Vote budget: 5 general verifiers (4-confirm / 2-refute) PLUS one dedicated

verifier per pattern in verifier_patterns.md (12 targeted attacks). Any pattern-specific HOLE FOUND counts toward refute.

Abstain threshold: 5 consecutive revise cycles fail (trust the model's

ability to eventually fix)

Pattern sweep: all 12, each with its own dedicated agent
Presentation pass: 3 drafts with different instructions ("most elegant,"

"most elementary," "shortest"), comparator picks the best. Strong models can genuinely produce different _styles_ of proof.

Rationale: Opus can execute the deep patterns (#19 base-vs-derived, #22

mean-first) that need real mathematical judgment. The 12 dedicated pattern passes are where the model's capability is best spent — it's the difference between "be skeptical" and "check THIS specific thing."

On the pigeonhole exit

Kept at all tiers — not because of cost, but because once inflight >= confirm_needed + refute_needed - 1, the remaining votes carry no information regardless of how they land. Launching them anyway is pure latency.

Identifying the tier

If the orchestrating session doesn't know which model it is, default to Sonnet configuration. A reasonable heuristic: ask the model to self-identify in its first response and match against haiku/sonnet/opus in the output.

Presentation Pass — Prompts and Templates

Premise: Aletheia's PDFs are beautiful; raw IMO output is not. The difference is a _presentation pass_: after a proof is verified correct, a fresh agent — one who didn't sweat through the discovery — finds the cleanest way to say it. The discoverer is too attached to the scaffolding.

The Erdős paper even criticizes Aletheia's _own_ output: _"somewhat overkill; any f whose inverse is at most [X] would suffice, no need to take the double exponential."_ The presentation pass is where overkill goes to die.

---

1. The Presentation Pass Prompt

Paste this to a fresh subagent along with the verified proof. The agent must not have discovery-context; that's the point.

You are given a verified, correct proof. Your job is not to check it — it

is correct. Your job is to find the cleanest presentation. The order it

was discovered in is almost never the order it should be read in.

Work through these questions in order:

Hindsight shortcuts. Now that you know the answer, is there a 3-line

argument? The discoverer built machinery to _find_ the key step; you already

_have_ the key step. Can the machinery be discarded? (Classic: a long

case-bash that, in hindsight, collapses once you spot the invariant.)

Overkill. Is any bound stronger than needed? Any construction more general

than the problem requires? If a double exponential works but a linear function

also works, use the linear one — the reader will wonder what the double

exponential is hiding. Match the strength of each tool to the strength of what

it's proving.

What to cut. Which steps _verify_ without _illuminating_? Discovery leaves

a debris field: sanity checks, dead ends backed out of, "note that X (we won't

use this)". Delete them. If a paragraph can be removed and the proof still

compiles in the reader's head, remove it.

Lemma granularity. Inline a lemma if it's used once and the proof is ≤3

lines. Keep it standalone if it's used twice, or if its _statement alone_

clarifies the structure (even with a 1-line proof). Name standalone lemmas

descriptively — "Combinatorial dimension bound", not "Lemma 2".

Order. Lead with the main statement. Then the one idea that makes it work.

Then the details. Isolate the one genuinely clever step — there's almost

always exactly one — and let everything else be obviously routine _by

contrast_.

Step names. Number steps _and_ name them: "**Step 3: Fourier inversion and

translation invariance.**" The name is a promise to the reader about what this

block accomplishes. Signpost reductions explicitly: "We are reduced to showing

that…"

Output clean LaTeX using the template below. Aim for: a strong grad student

could reconstruct every suppressed detail, a professor could skim the step

names alone and nod.

---

2. LaTeX Output Template

Minimal preamble — Aletheia's environments, none of its ornament. No tcolorbox, no custom colors.

\documentclass[11pt]{article}
\usepackage[margin=1.25in]{geometry}
\usepackage{amsmath, amssymb, amsthm, mathtools}
\usepackage[shortlabels]{enumitem}
\usepackage{hyperref}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{claim}{Claim}
\newtheorem{proposition}[theorem]{Proposition}

\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem*{remark}{Remark}

\begin{document}

\section*{Problem}
% Restate the problem exactly. No paraphrase.

\section*{Solution}

\begin{theorem}
% State what you will prove, in full. If the answer is "yes" or "no"
% or a specific value, state it here so the reader isn't kept in suspense.
\end{theorem}

% If a lemma is reused or structurally load-bearing, state it before
% the main proof. One-shot verifications get inlined below.
% \begin{lemma}\label{lem:key}
%   ...
% \end{lemma}
% \begin{proof} ... \end{proof}

\begin{proof}[Proof of Theorem]
\textbf{Step 1: [Descriptive name — what this step accomplishes].}
% e.g. "Reduction to the compact case." / "The key invariant."

% Display important equations; inline routine ones.
% End a reduction step with: "We are reduced to showing that ..."

\textbf{Step 2: [Name].}
% ...

\textbf{Step $n$: Conclusion.}
% One or two sentences. Make the contradiction / induction close / final
% computation land visibly.
\end{proof}

\end{document}

Style conventions lifted from the Aletheia samples:

Display math for the equation a step _produces_; inline math for the algebra

getting there.

Cite precisely when invoking a named result:

_(Jacquet–Piatetski-Shapiro–Shalika, 1981)_ — not "by a well-known theorem".

In contradiction proofs: state the false assumption plainly ("Suppose, for

contradiction, that…"), and flag the collision plainly ("We are led to the contradiction $0 > 0$.").

Integer bounds earn the ceiling: if $d \ge n/k$ and $d \in \mathbb{Z}$, write

$d \ge \lceil n/k \rceil$. Free sharpness.

---

3. Anti-Patterns to Catch

The presentation agent should flag and fix these:

Discovery-order exposition. "First I tried X, which led me to notice Y…" —

the reader doesn't care. State Y.

Overkill constructions. The tell: the bound you prove is parametrically

stronger than what the next line consumes. Weaken it until it's tight.

Proof by intimidation. _"It is trivial to see that…"_, _"Obviously…"_, _"A

standard argument shows…"_ — if it's trivial, one sentence suffices. Write the sentence.

Unnecessary generality. Proving it for all $n$ when the problem asks about

$n=3$ and the general case adds no insight, only indices.

Orphan lemmas. Stated, proved, cited once, three lines long. Inline it.
Unlabeled case splits. Five cases, no indication of why five or what

distinguishes them. Name the cases; say upfront which one carries the content.

Missing signposts. A page of computation with no "we are reduced to" / "it

suffices to show" markers. The reader shouldn't have to reverse-engineer your strategy.

Solver Heuristics (Pólya + Olympiad Practice)

For solver subagents. These are the moves to try when the direct approach stalls.

Pólya's core moves (from "How to Solve It")

Have you seen a related problem? Not the same problem — one with the same UNKNOWN, or the same STRUCTURE. A problem about covering points with lines has the same shape as one about covering lattice points with arithmetic progressions.

Specialize. If you can't solve the given problem, solve n=3, n=4, n=5 by hand. The pattern is often the proof. (But: test past the first nontrivial case — n≤3 may be degenerate.)

Generalize (inventor's paradox). The more ambitious problem sometimes has MORE structure and is easier. "Prove for all primes" might be harder than "prove for all integers" if the integer case has a clean induction.

Drop a condition. What happens if you relax one hypothesis? Does the result become trivially false? Where? That WHERE is often the key step — the point where the condition is load-bearing.

Work backwards. Start from what you want to prove. What would imply it? What would imply THAT? If this chain meets something you can prove directly, you have the proof (reversed).

Auxiliary element. Introduce something not in the problem — a new variable, a reflection, a well-chosen function. Olympiad geometry lives on this (auxiliary points, circles).

Olympiad-specific moves

Find the invariant. If there's a process (game, transformation, iteration), what quantity is preserved? Parity, sum, product modulo something.

Find the extremal. Take the LARGEST, or SMALLEST, or LEFTMOST object. Extremal choices often have extra properties that generic choices don't.

Double count. Count the same thing two ways. Incidences, edges, sums over pairs.

Coloring / parity. Can you 2-color the objects so the claim becomes a parity statement?

Smoothing / adjusting. For inequalities: if you perturb two variables closer together (or further apart), does the expression increase or decrease? Extremize.

Symmetry → WLOG. If the problem is symmetric in x,y,z, you can assume x≤y≤z. But only if the conclusion is ALSO symmetric.

Geometry-specific moves

Standard angles (induction, invariants, extremal) are often wrong-shaped for olympiad geometry. Use these instead:

Coordinate bash. Place the configuration in coordinates. Choose them to kill degrees of freedom (origin at a center, axis along a line). Grind out the algebra. Ugly but reliable.

Auxiliary point. Introduce a point not in the problem — a reflection, a second intersection, the point where two lines "should" meet. Often the key construction is finding the right extra point.

Power of a point. For any point P and circle ω, PA·PB is the same for every line through P meeting ω at A, B. Use it to turn ratios into equalities.

Spiral similarity / rotation. Two directly similar triangles are related by a spiral similarity (rotation + scaling about a fixed point). Find that point — it often lies on a circle you already have.

Inversion. When there are many circles or tangencies, invert about a well-chosen center. Circles through the center become lines; tangencies become simpler tangencies.

Angle chase. Cyclic quadrilaterals give equal angles. Tangent-chord gives an angle equal to the inscribed angle. Chase around the figure.

Geometry-specific moves (these are DIFFERENT)

The standard angles (invariant, extremal, induction) don't fit circles/circumcenters/orthocenters. Geometry needs:

Coordinate bash. Place one point at origin, another on the x-axis. Compute everything explicitly. The algebra is heavy but mechanical. For two circles with centers M, N and radii r, R: set M=(0,0), N=(d,0), then the intersection points have x-coordinate (r²+d²−R²)/2d and everything follows.

Auxiliary point. Introduce a point not in the problem — the reflection, the foot of a perpendicular, the second intersection. Olympiad geometry lives on finding the right extra point.

Power of a point. For point P and circle Γ: PA·PB is constant for any line through P meeting Γ at A,B. This converts circles to products.

Inversion. Circles through the center become lines. Sometimes the inverted problem is trivial.

Angle chasing / cyclic quads. Four points are concyclic iff opposite angles sum to π. Chase angles until enough equalities force concyclicity.

Recurrence-specific trap

For recurrences like b\_{n+1} = P(b_n) where P is polynomial degree ≥ 2: b_n grows doubly-exponentially. You cannot compute b_30 exactly — it has trillions of digits. Work in ℤ/2^m (or ℤ/p^m) from the start. Prove b_n ≡ r_n (mod 2^m) by induction on n, NOT by computing b_n.

When the answer involves √n or log n

These answers often come from a structure that is NOT the obvious/symmetric one. The diagonal, the identity, the "natural" choice frequently gives the WORST case, not the best — it clusters the constraint in a way that prevents large substructures.

For pure-reasoning solvers: Before claiming the symmetric choice is optimal, ask "what if I deliberately break the symmetry?" For grid/covering problems: what if the gaps are SPREAD OUT instead of clustered? For sequences: what if the extremal sequence is NOT constant or linear?

For deep-mode agents: Brute-force n=3..8 before theorizing. If the formula that fits is n+c√n instead of cn, the structure has √n-sized blocks.

The Look Back phase (after you have a proof)

Can you check it? Plug in small cases. Does n=3 give what your formula

says?

Can you prove it differently? A second proof is a verification. And often

shorter.

Is your bound tight? If you proved ≤ N and the answer is exactly N, find

the extremal case. If you can't, your bound might be loose.

What did you actually use? Sometimes you used less than all the hypotheses

— the real theorem is stronger.

Verifier Patterns — Olympiad Subset

For a verifier with no tools, only reasoning. Each pattern is a mental check you can run on a candidate proof. These are the specific ways proofs go wrong that self-verification misses. (Source: 59 patterns from real research sessions; these 13 need no grep/fetch/compute.)

Run #18 and #19 after any positive finding. Run #40 on any proof that feels too short.

---

Pattern 4: Would it prove a famous open problem?

The check: Specialize the claimed theorem to the most famous object in its class (ζ(s), the Ramsey number, the Collatz map). Does the specialization settle a known open problem?

What it catches: A bound "for all Dirichlet series with property P" that, applied to ζ, would prove Lindelöf — the proof treated arithmetic input as generic.

How to run it: Find the step where the argument uses a generic property. Ask: does ζ (or the canonical hard instance) actually have this property? The gap is always where it doesn't.

---

Pattern 5: Outside the hypothesis class

The check: For each example claimed to satisfy a theorem, re-derive the hypotheses from the definition — don't trust the label.

What it catches: "f is entire of order ≤1, so by Thm 3.1…" — but Thm 3.1 needs f analytic in a _full disk_ around 0; a natural boundary on the imaginary axis blocks it.

How to run it: Write out the theorem's hypothesis verbatim. For each claimed instance, check inclusion from scratch. Watch for near-synonyms ("bounded" vs "bounded on the line"; "entire" vs "analytic on a domain").

---

Pattern 6: Divergent sum behind analytic continuation

The check: When a divergent-looking sum is "bounded by ζ(s)" or similar, evaluate the bounding function at the boundary of the claimed range.

What it catches: "Σ 1/n ≤ ζ(1)" — but ζ(1) is a pole. The analytic continuation of a sum is not the sum.

How to run it: Mentally substitute the boundary value of the parameter into the bounding expression. A pole or ∞ there means the original sum diverges, regardless of what the continued function says elsewhere.

---

Pattern 10: Same keywords, different theorem

The check: When a cited theorem has the right _words_ but the fit feels off — check pointwise vs averaged, uniform vs a.e., finite vs asymptotic.

What it catches: Invoking "Fourier decay ⇒ bound" for a pointwise estimate, when the cited decay theorem needs curvature and you only have it on average.

How to run it: State precisely what the proof _needs_ (pointwise? for all x? with what uniformity?). State what the cited theorem _gives_. Sometimes the weaker version is enough and this _closes_ a gap; sometimes the gap is real.

---

Pattern 17: Test past the first nontrivial block

The check: Before accepting a pattern from small cases, identify where the structure first becomes nontrivial. Confirm the pattern holds _past_ that threshold.

What it catches: "Checked m = 1, 2, 3: all blocks have rank 1." But m ≤ 3 gives only 1×2 blocks — rank 1 is forced. First 2×2 appears at m = 4, and there the claim fails.

How to run it: Ask "what makes the small cases easy?" Find the parameter value where that degeneracy disappears. The claim must survive at least one case beyond it.

---

Pattern 18: Tautological reduction

The check: When a reduction chain ends at "estimate X would finish it," substitute the chain's own already-proven identities into X.

What it catches: "Suffices to show ∫|P|² ≤ C·H." But the chain itself proved ∫|P|² = H + 2Re(OD') _exactly_. So X is just the original conjecture plus a cosmetic shift — not a reduction.

How to run it: Take each identity the chain proved along the way and plug it into the "final gap." If you recover the starting conjecture (or something at least as strong), the chain went in a circle.

---

Pattern 19: Derived obstruction vs base obstruction

The check: When the same obstruction kills 3+ independent approaches, compute the disputed property on the _original_ object — before any reduction.

What it catches: "det(Hessian) = 0, ruled surface, decoupling fails" — for the phase log(2πm−θ). But the _base_ phase is nθ − t·log(n), and _its_ Hessian has det = −1. The obstruction lived in the proxy.

How to run it: Name the object the obstruction is _about_. Is it the thing you started with, or something a reduction produced? Go back to the start and check directly.

---

Pattern 22: Absolute-sum gives O(K); compute the mean first

The check: Before accepting that Σₖ Xₖ = O(1) is "too hard because |Xₖ| summed gives O(K)," compute the mean of Xₖ over the varying parameter.

What it catches: Weyl equidistribution gives mean(Xₖ) = 0 _exactly_. So Σ Xₖ is a fluctuation sum — the target is Var = O(1), and half the conjecture falls in one line.

How to run it: Separate Xₖ into mean + fluctuation. If orthogonality/equidistribution forces the mean to zero, you were never fighting K terms of size 1 — you were fighting √K terms (or better). Rewrite the target.

---

Pattern 23: Formula's scope never stated

The check: For any identity used in the proof, ask: was this proved for the general case, or for a special case that the author silently generalized?

What it catches: "κ₄ = 3d − 1" was derived for 2-piece Cantor sets. The proof applies it to an m-piece set, where the real formula involves additive energy and can differ by a constant factor.

How to run it: Trace the identity to where it was first introduced. What were the standing assumptions _there_? Check that those assumptions still hold at the point of use.

---

Pattern 35: Count quantifiers before diagonalizing

The check: Before "diagonalize against class C using property P," ask whether _certifying_ P is an ∃-statement or a ∀-statement.

What it catches: "Find an x not computed by any small circuit" — but verifying "no small circuit computes x" is a ∀ over circuits. Your diagonalizer is in Σ₂, not NP. (This is _why_ Kannan gives Σ₂ᴾ ⊄ SIZE, not NP ⊄ SIZE.)

How to run it: Write the diagonalization as a formula. Count alternations. If you need ∀∃ to describe the witness, you've jumped a level in the hierarchy.

---

Pattern 40: One-line-proof-too-clean

The check: Extract the proof's key step as a lemma in _full generality_ — not specialized to the objects at hand. Try a 2×2 counterexample to the general lemma.

What it catches: "rank depends only on monomial support" — but [[1,1],[1,1]] has rank 1 and [[1,1],[1,−1]] has rank 2 with the same support. The general lemma is false; the specific case holds because sgn(π) = f(S)·g(T) factors. _That's_ the real proof.

How to run it: If the general lemma dies but the specific conclusion survives numerically, there's hidden structure. Find it. The real proof goes through _that_, not the false lemma.

---

Pattern 58: Quantifier direction on domain size

The check: Before claiming one statement is "strictly stronger" than another because its domain is smaller — check whether the quantifier is ∀ or ∃.

What it catches: "∀ S ∈ D, φ(S)" over a _smaller_ D is _weaker_ (fewer obligations). "∃ S ∈ D, φ(S)" over smaller D is _stronger_ (fewer candidates). Backwards strength claims swap these.

How to run it: Say the statement out loud with the quantifier explicit. Shrinking the domain under ∀ drops requirements. Shrinking under ∃ drops witnesses. Only one direction is "harder."

---

Pattern 60: Easiest-interpretation trap

The check: Before solving, write down 2–3 readings of the problem statement. Flag whichever one makes the problem trivial.

What it catches: 63 "technically correct" solutions; only 13 "meaningfully correct." The gap: solving the easiest grammatically-valid reading instead of the intended one. Olympiad problems often _plant_ an easy misreading.

How to run it: Ask "under which reading is this a real problem?" If your interpretation makes it a one-liner and the problem is worth 7 points, you've probably chosen wrong. Solve the hard reading; note the easy one only as a remark.

#!/bin/bash
# Exit 0 if a LaTeX compiler is available, 1 otherwise.
# Used by SKILL.md to decide whether to offer PDF compilation.
command -v pdflatex >/dev/null 2>&1 || command -v xelatex >/dev/null 2>&1

#!/bin/bash
# Compile a LaTeX proof body into a standalone PDF.
# Usage: compile_pdf.sh <body.tex> <output_dir>
# The body.tex should contain just the \begin{document}...\end{document} contents
# (theorem, proof, lemmas). This script wraps it in a minimal preamble.

set -euo pipefail

BODY="$1"
OUTDIR="${2:-.}"
BASENAME=$(basename "$BODY" .tex)
FULL="$OUTDIR/${BASENAME}_full.tex"

cat > "$FULL" <<'PREAMBLE'
\documentclass[11pt]{article}
\usepackage[margin=1.25in]{geometry}
\usepackage{amsmath, amssymb, amsthm, mathtools}
\usepackage[shortlabels]{enumitem}
\usepackage{enumitem}
\usepackage[colorlinks=true, linkcolor=blue, citecolor=blue]{hyperref}

\theoremstyle{plain}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}[theorem]{Lemma}
\newtheorem{claim}[theorem]{Claim}
\newtheorem{proposition}[theorem]{Proposition}
\newtheorem{corollary}[theorem]{Corollary}

\theoremstyle{definition}
\newtheorem{definition}[theorem]{Definition}
\newtheorem{remark}[theorem]{Remark}

\begin{document}
PREAMBLE

cat "$BODY" >> "$FULL"

cat >> "$FULL" <<'CLOSE'
\end{document}
CLOSE

if command -v pdflatex >/dev/null 2>&1; then
    COMPILER=pdflatex
elif command -v xelatex >/dev/null 2>&1; then
    COMPILER=xelatex
else
    echo "No LaTeX compiler found" >&2
    exit 1
fi

cd "$OUTDIR"
$COMPILER -interaction=nonstopmode -halt-on-error "${BASENAME}_full.tex" >/dev/null
$COMPILER -interaction=nonstopmode -halt-on-error "${BASENAME}_full.tex" >/dev/null
echo "$OUTDIR/${BASENAME}_full.pdf"

Related skills

Setup Matt Pocock SkillsScaffold the per-repo configuration that Matt Pocock’s engineering agent skills rely on so they understand the issue tracker, triage labels, and domain documentation la462k185k

Lark Skill MakerQuickly turn any Lark/Feishu OpenAPI call or multi-step workflow into a reusable agent skill with its own SKILL.md.379k15.8k

CavemanSlash token usage by roughly 75% while keeping every technical detail intact when working with Claude Code, Cursor or similar agents.378k92.5k

Lark AppsConnect Claude, Cursor or custom agents directly to Lark (Feishu) for messaging, document automation, approval workflows and enterprise data access.375k

Running Claude Code Via Litellm CopilotRun Claude Code at a fraction of the cost by routing requests through LiteLLM to the GitHub Copilot Chat API.270k72

Codex PetGenerate a complete Codex Pet spritesheet and metadata from one reference image without needing an OpenAI key or Codex Pro.246k8

How it compares

Use math-olympiad for competition proof work; use general calculator or numeric libraries for engineering math and data analysis.

FAQ

Why does the verifier see only the cleaned proof, not the solver's thinking trace?

A verifier that sees a long chain of reasoning is biased toward agreement—the chain reads as supporting evidence even when the conclusion is wrong. Fresh context with only the final proof and problem statement forces adversarial thinking.

What does 'asymmetric vote' mean? Why 4 to confirm and 2 to refute?

One flaky verifier shouldn't kill a correct proof, but two independent dissents signal a real problem. 4 HOLDS confirms high confidence; 2 HOLE FOUND refutes (minority opinion matters when it agrees). Asymmetry optimizes conditional accuracy.

What is deep mode, and when should I use it?

Deep mode is a single focused agent with unlimited time and bounded local computation (no web, no lookup). Use it after tight-budget solvers or verifiers abstain. Often finds constructions pure reasoning missed (e.g., √n structure, mod-k patterns).

Is Math Olympiad safe to install?

skills.sh reports 3 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

AI & Agent Buildingagentsresearch

About

Math Olympiad by the numbers

math-olympiad capabilities & compatibility

Add your badge

What it does

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Math Olympiad Solver

The five things that change outcomes

When to use which approach

For a full problem set

The Workflow

1. Interpretation check (30 seconds, catches 50/63 of one class of errors)

2. Generate candidates with internal refinement (parallel, thinking only)

3. Clean the solution (context isolation — the #1 lever)

4. Adversarial verify (fresh context, pattern-armed)

5. Rank and vote-verify (asymmetric + early exit)

5b. When one case won't close — step back before grinding

6. Revise (if needed)

6c. Deep mode (when tight-budget abstains)

7. Calibrated abstention

8. Presentation pass (after correctness is established)

Model tier defaults

For numeric-answer problems (AIME-style)

Key references

What makes this different from generic verify-and-refine

Adversarial Verifier Prompts — Math Olympiad

Reasons to REFUTE (the taxonomy — look for ANY one of these)

1. General Adversarial (default)

2. Pattern #4 — Would It Prove Too Much?

3. Pattern #40 — One-Line-Proof-Too-Clean

4. Pattern #18 — Tautological Reduction

5. Pattern #60 — Specification-Gaming

6. Consecutive-Verify (5-pass loop)

7. Adversarial Brief (for the reviser when pattern #40 fires)

Solver-Refiner Agent Prompt

Your process (iterate internally until done)

Core principles (from Yang-Huang IMO25)

Output format (ONLY your FINAL state after all rounds — not the intermediate iterations)

Construction Patterns

Spread vs cluster

Moment curve for distinctness

When brute-force reveals √n

Avoid: storing specific answers here

Model Tier Defaults

Haiku

Sonnet

Opus

On the pigeonhole exit

Identifying the tier

Presentation Pass — Prompts and Templates

1. The Presentation Pass Prompt

2. LaTeX Output Template

3. Anti-Patterns to Catch

Solver Heuristics (Pólya + Olympiad Practice)

Pólya's core moves (from "How to Solve It")

Olympiad-specific moves

Geometry-specific moves

Geometry-specific moves (these are DIFFERENT)

Recurrence-specific trap

When the answer involves √n or log n

The Look Back phase (after you have a proof)

Verifier Patterns — Olympiad Subset

Pattern 4: Would it prove a famous open problem?

Pattern 5: Outside the hypothesis class

Pattern 6: Divergent sum behind analytic continuation

Pattern 10: Same keywords, different theorem

Pattern 17: Test past the first nontrivial block

Pattern 18: Tautological reduction

Pattern 19: Derived obstruction vs base obstruction

Pattern 22: Absolute-sum gives O(K); compute the mean first

Pattern 23: Formula's scope never stated

Pattern 35: Count quantifiers before diagonalizing

Pattern 40: One-line-proof-too-clean

Pattern 58: Quantifier direction on domain size

Pattern 60: Easiest-interpretation trap

Related skills