Prompt Engineer Toolkit

Name: Prompt Engineer Toolkit
Author: alirezarezvani

alirezarezvani/claude-skills

579 installs
23.5k repo stars
Updated July 17, 2026
alirezarezvani/claude-skills

prompt-engineer-toolkit is a Claude Code skill that analyzes, rewrites, and templates marketing prompts so developers running AI-assisted ad copy, email, and social workflows get consistent on-brand output after model ch

About

prompt-engineer-toolkit is an MIT-licensed Claude Code skill (version 1.0.0) for prompt engineering on marketing use cases. It analyzes weak prompts, rewrites them for clearer AI output, and packages reusable templates for ad copy, email campaigns, and social posts. The skill also structures end-to-end AI content workflows so teams can version and retest prompts instead of rewriting from scratch after every model update. Developers reach for prompt-engineer-toolkit when triggers mention prompt engineering, prompt templates, AI writing quality, or optimizing AI content workflows inside Claude Code or compatible agents.

A/B prompt evaluation against structured marketing test cases
Quantitative scoring for adherence, relevance, and safety checks
Immutable prompt version history with diffs and changelog
Reusable templates for ad copy, email campaigns, and social media
End-to-end AI content workflow structuring with evidence-based rollout

Prompt Engineer Toolkit by the numbers

579 all-time installs (skills.sh)
Ranked #699 of 1,879 Marketing & SEO skills by installs in the Skillselion catalog
Security screen: HIGH risk (skills.sh audit)
Data as of Jul 31, 2026 (Skillselion catalog sync)

npx skills add https://github.com/alirezarezvani/claude-skills --skill prompt-engineer-toolkit

Add your badge

Show developers this skill is listed on Skillselion. Paste this into your README.

[![Listed on Skillselion](https://skillselion.com/badge/skills/alirezarezvani/claude-skills/prompt-engineer-toolkit.svg)](https://skillselion.com/skills/alirezarezvani/claude-skills/prompt-engineer-toolkit)

Installs	579
repo stars	★ 23.5k
Security audit	2 / 3 scanners passed
Last updated	July 17, 2026
Repository	alirezarezvani/claude-skills ↗

How do you version and test marketing AI prompts?

Turn messy marketing prompts into versioned, testable templates and workflows so AI-assisted ad copy, email, and social posts stay on-brand after model changes.

Who is it for?

Developers and growth engineers who ship AI-assisted marketing copy through Claude Code and need repeatable, on-brand prompt templates.

Skip if: Teams that only need one-off creative writing without reusable templates, versioning, or marketing-channel structure should skip prompt-engineer-toolkit.

When should I use this skill?

The user asks to improve prompts, build prompt templates, optimize AI content workflows, or mentions prompt engineering for marketing copy.

What you get

Versioned prompt templates, rewritten prompts, and structured AI content workflow definitions for marketing channels.

Rewritten prompts
Reusable prompt templates
AI content workflow outlines

By the numbers

Skill metadata version 1.0.0
MIT license
Updated 2026-03-06

Files

SKILL.mdMarkdownGitHub ↗

Prompt Engineer Toolkit

Overview

Use this skill to move prompts from ad-hoc drafts to production assets with repeatable testing, versioning, and regression safety. It emphasizes measurable quality over intuition. Apply it when launching a new LLM feature that needs reliable outputs, when prompt quality degrades after model or instruction changes, when multiple team members edit prompts and need history/diffs, when you need evidence-based prompt choice for production rollout, or when you want consistent prompt governance across environments.

Core Capabilities

A/B prompt evaluation against structured test cases
Quantitative scoring for adherence, relevance, and safety checks
Prompt version tracking with immutable history and changelog
Prompt diffs to review behavior-impacting edits
Reusable prompt templates and selection guidance
Regression-friendly workflows for model/prompt updates

Key Workflows

1. Run Prompt A/B Test

Prepare JSON test cases and run:

python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --runner-cmd 'my-llm-cli --prompt {prompt} --input {input}' \
  --format text

Input can also come from stdin/--input JSON payload.

2. Choose Winner With Evidence

The tester scores outputs per case and aggregates:

expected content coverage
forbidden content violations
regex/format compliance
output length sanity

Use the higher-scoring prompt as candidate baseline, then run regression suite.

3. Version Prompts

# Add version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/support_v3.txt \
  --author alice

# Diff versions
python3 scripts/prompt_versioner.py diff --name support_classifier --from-version 2 --to-version 3

# Changelog
python3 scripts/prompt_versioner.py changelog --name support_classifier

4. Regression Loop

1. Store baseline version. 2. Propose prompt edits. 3. Re-run A/B test. 4. Promote only if score and safety constraints improve.

Script Interfaces

python3 scripts/prompt_tester.py --help
Reads prompts/cases from stdin or --input
Optional external runner command
Emits text or JSON metrics
python3 scripts/prompt_versioner.py --help
Manages prompt history (add, list, diff, changelog)
Stores metadata and content snapshots locally

Pitfalls, Best Practices & Review Checklist

Avoid these mistakes: 1. Picking prompts from single-case outputs — use a realistic, edge-case-rich test suite. 2. Changing prompt and model simultaneously — always isolate variables. 3. Missing must_not_contain (forbidden-content) checks in evaluation criteria. 4. Editing prompts without version metadata, author, or change rationale. 5. Skipping semantic diffs before deploying a new prompt version. 6. Optimizing one benchmark while harming edge cases — track the full suite. 7. Model swap without rerunning the baseline A/B suite.

Before promoting any prompt, confirm:

[ ] Task intent is explicit and unambiguous.
[ ] Output schema/format is explicit.
[ ] Safety and exclusion constraints are explicit.
[ ] No contradictory instructions.
[ ] No unnecessary verbosity tokens.
[ ] A/B score improves and violation count stays at zero.

References

references/prompt-templates.md — 6 production marketing templates (ad copy, email sequence, social repurposing, landing sections, SEO meta, brand-voice rewrite) plus generic building blocks; each written to be graded by prompt_tester.py
references/technique-guide.md — technique-selection table for marketing tasks + the LLM-governance stack for marketing teams (claim discipline, disclosure rules, data boundaries, human-review gates)
references/evaluation-rubric.md — mechanical scoring weights, acceptance gates, marketing quality dimensions, test-suite design, and eval anti-patterns
README.md

Evaluation Design

Each test case should define:

input: realistic production-like input
expected_contains: required markers/content
forbidden_contains: disallowed phrases or unsafe content
expected_regex: required structural patterns

This enables deterministic grading across prompt variants.

Versioning Policy

Use semantic prompt identifiers per feature (support_classifier, ad_copy_shortform).
Record author + change note for every revision.
Never overwrite historical versions.
Diff before promoting a new prompt to production.

Rollout Strategy

1. Create baseline prompt version. 2. Propose candidate prompt. 3. Run A/B suite against same cases. 4. Promote only if winner improves average and keeps violation count at zero. 5. Track post-release feedback and feed new failure cases back into test suite.

Prompt Engineer Toolkit

Production toolkit for evaluating and versioning prompts with measurable quality signals. Includes A/B testing automation and prompt history management with diffs.

Quick Start

# Run A/B prompt evaluation
python3 scripts/prompt_tester.py \
  --prompt-a-file prompts/a.txt \
  --prompt-b-file prompts/b.txt \
  --cases-file testcases.json \
  --format text

# Store a prompt version
python3 scripts/prompt_versioner.py add \
  --name support_classifier \
  --prompt-file prompts/a.txt \
  --author team

Included Tools

scripts/prompt_tester.py: A/B testing with per-case scoring and aggregate winner
scripts/prompt_versioner.py: prompt history (add, list, diff, changelog) in local JSONL store

References

references/prompt-templates.md
references/technique-guide.md
references/evaluation-rubric.md

Installation

Claude Code

cp -R marketing-skill/prompt-engineer-toolkit ~/.claude/skills/prompt-engineer-toolkit

OpenAI Codex

cp -R marketing-skill/prompt-engineer-toolkit ~/.codex/skills/prompt-engineer-toolkit

OpenClaw

cp -R marketing-skill/prompt-engineer-toolkit ~/.openclaw/skills/prompt-engineer-toolkit

Evaluation Rubric for Marketing Prompts

How to score prompt outputs deterministically with scripts/prompt_tester.py, and how to extend the mechanical score with marketing-specific quality dimensions a regex can't fully capture. The principle throughout: evidence over intuition — a prompt is "better" only if it scores better on a realistic, edge-case-rich suite (never a single cherry-picked output).

Layer 1 — Mechanical Score (what `prompt_tester.py` computes)

Score each test case 0-100 via weighted criteria:

Criterion	Direction	Typical weight	Test-case field
Expected content coverage	+	40%	`expected_contains`
Forbidden content violations	− (hard penalty)	30%	`forbidden_contains`
Regex/format compliance	+	20%	`expected_regex`
Output length sanity	±	10%	min/max length

Acceptance gates (promote a prompt only if all hold):

Average score ≥ 85 across the suite
No individual case below 70
Zero critical forbidden-content hits (brand-banned words, invented statistics markers, competitor names where disallowed, compliance terms — see governance guide)

Layer 2 — Marketing Quality Dimensions

Encode as many of these as possible into Layer-1 fields; what remains needs human review on a sample (5-10 outputs per variant):

Dimension	Mechanical proxy	Human check
Specificity	`expected_regex` for digits/named entities	Is the specific claim true and sourced?
Brand voice	`forbidden_contains` lexicon-no list	Does it sound like us, not "an AI"?
Claim safety	forbidden superlatives ("best", "#1", "guaranteed") unless proof token present	Would legal/compliance sign off?
Format fitness	char-count regex per platform	Does it read natively on the platform?
CTA quality	required CTA token	Single clear action, value-phrased?
Audience fit	required pain-point/persona token	Would the named persona care?

Scoring scale for human review (per Hamel Husain's eval guidance, keep it binary where possible): pass/fail per dimension beats 1-5 ratings — raters agree more, and failures become new forbidden_contains/expected_regex entries, ratcheting the mechanical suite forward.

Building the Test Suite

A marketing prompt suite needs at minimum:

1. Happy-path cases (3-5) — typical inputs with complete variables 2. Sparse-input cases (2-3) — missing proof points, vague audience: the prompt must degrade safely (omit proof, ask, or flag) rather than fabricate 3. Adversarial cases (2-3) — inputs that bait policy violations: competitor disparagement requests, unverifiable claims supplied as "facts", off-brand tone requests 4. Edge-format cases (1-2) — very long inputs, non-English fragments, emoji-laden source content

Failure analysis loop: every production failure (rejected ad, spam-flagged email, off-brand post) becomes a new test case before the prompt is edited — the marketing equivalent of regression-test-first.

Anti-Patterns

Single-output judgment — comparing one generation per prompt; sampling variance swamps prompt differences. Run every case ≥ 3 times or compare suite averages.
LLM-as-judge without calibration — if you add a model-graded criterion, calibrate it against human labels on 20+ examples first and re-check periodically (judges drift with model versions).
Score-only promotion — a +2 average that introduces one compliance violation is a regression, not an improvement. Violations gate, scores rank.
Frozen suite — a suite that never grows stops catching new failure modes; tie suite growth to the failure-analysis loop above.

---

Citations (6 sources)

1. Anthropic — "Define your success criteria" + "Create strong empirical evaluations" (docs.anthropic.com/en/docs/build-with-claude/define-success, /develop-tests): measurable criteria and graded test suites before prompt iteration 2. OpenAI Evals — open-source eval framework and registry patterns for templated, deterministic graders (github.com/openai/evals) 3. Hamel Husain — "Your AI Product Needs Evals" (hamel.dev/blog/posts/evals): unit-test-style assertions, failure-driven suite growth, binary human labels 4. Zheng et al. — "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023): LLM-judge agreement rates and bias modes (position, verbosity) 5. Eugene Yan — "Patterns for Building LLM-based Systems & Products" (eugeneyan.com): eval-first development, guardrails as gates vs. scores as ranks 6. Liu et al. — "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (EMNLP 2023): criteria-decomposed grading for content quality dimensions

Marketing Prompt Templates

Production-ready prompt templates for the marketing use cases this skill promises: ad copy, email campaigns, social media, landing pages, and SEO metadata. Each template is written to be testable with scripts/prompt_tester.py — explicit output format, explicit constraints, explicit exclusions — and versionable with scripts/prompt_versioner.py under the semantic name given.

Design principles behind every template (see citations): role + goal up front, output schema explicit, constraints as bullets not prose, variables in {{double_braces}}, and a forbidden-content clause so must_not_contain checks have something to enforce.

---

1) Ad Copy Variants — `ad_copy_shortform`

You are a direct-response copywriter for {{brand}} ({{one_line_positioning}}).

Write {{count}} ad copy variants for {{platform}} promoting {{offer}}.

Audience: {{audience}} — their #1 pain: {{pain_point}}.
Voice: {{voice_adjectives}}. Reading level: 7th grade.

Hard constraints:
- Headline ≤ {{headline_limit}} characters; primary text ≤ {{body_limit}} characters
- Each variant uses a DIFFERENT angle: pain-led, outcome-led, proof-led, curiosity-led
- One specific, verifiable claim per variant ({{proof_points}}); never invent statistics
- No exclamation-point stacking, no "🚀", no "game-changing/revolutionary/unleash"

Return JSON array: [{"angle":"...","headline":"...","primary_text":"...","cta":"..."}]

Test cases should assert character limits via expected_regex and ban the cliché list via forbidden_contains.

2) Email Campaign Sequence — `email_campaign_writer`

You are a lifecycle email marketer for {{brand}}.

Write email {{n}} of {{total}} in a {{sequence_type}} sequence (goal: {{conversion_goal}}).
Reader context: {{what_they_did}} — they have NOT yet {{what_they_havent_done}}.

Constraints:
- Subject line ≤ 45 chars + preview text ≤ 90 chars; no spam-trigger words (free!!!, act now, limited time)
- Body 90-150 words, one idea, one CTA ({{cta_text}} → {{cta_url}})
- Plain-text tone — write like a competent colleague, not a brand
- Reference the reader's situation in sentence 1; never open with "I hope this finds you well"

Return:
SUBJECT: ...
PREVIEW: ...
BODY:
...
CTA: ...

3) Social Media Post Set — `social_post_repurposer`

You are a social content editor. Repurpose the source content into {{count}} platform-native posts.

Platforms: {{platforms}}.
Source:
{{source_content}}

Per-platform rules:
- X: ≤ 280 chars, hook in first 8 words, max 1 hashtag, placed at the end
- LinkedIn: ≤ 1300 chars, line breaks every 1-2 sentences, no engagement-bait ("Agree?")
- Instagram: caption ≤ 150 words + 5 relevant hashtags at the end

Every post must contain one specific detail (number, name, example) from the source.
Return JSON: [{"platform":"...","post":"...","specific_detail_used":"..."}]

4) Landing Page Section Copy — `landing_section_writer`

You are a conversion copywriter. Write the {{section}} section for a landing page.

Product: {{product}} — for {{audience}} who want {{outcome}}.
Differentiator: {{differentiator}}. Proof available: {{proof_points}}.

Constraints:
- Headline: specific outcome, ≤ 12 words, no category jargon
- Body: benefit-first, "you" language, ≤ 60 words
- Use ONLY the proof points provided; if none fit, omit proof rather than invent it
- CTA button: verb + value ("Get my report"), never "Submit"/"Learn more"

Return markdown with HEADLINE / BODY / CTA blocks.

5) SEO Title + Meta Description — `seo_meta_writer`

You are an SEO editor. Write title tag + meta description for the page below.

Primary keyword: {{keyword}} (must appear in title, near the front, naturally).
Search intent: {{intent}}. Page summary: {{summary}}.

Constraints:
- Title ≤ 60 characters, no clickbait, no ALL CAPS, brand suffix " | {{brand}}" if it fits
- Meta description 150-160 characters, includes keyword once, ends with a reason to click
- Describe what the page actually contains — no promises the page doesn't keep

Return JSON: {"title":"...","title_chars":N,"meta":"...","meta_chars":N}

6) Brand-Voice Content Rewrite — `brand_voice_rewriter`

You are {{brand}}'s editor. Rewrite the draft in our voice without changing facts or claims.

Voice profile (from .claude/product-marketing-context.md): {{voice_profile}}
Words we use: {{lexicon_yes}}. Words we never use: {{lexicon_no}}.

Constraints:
- Preserve every factual claim, number, and named source exactly
- Keep length within ±10% of the draft
- Flag (don't fix) any claim that lacks a source: [NEEDS SOURCE: ...]

Draft:
{{input}}

7) Generic Building Blocks

The original toolkit templates (structured extractor, classifier, summarizer, constrained rewrite, persona rewrite, policy-compliance check, prompt critique) remain useful as building blocks for non-content marketing automation — lead triage, review mining, survey coding. Pattern:

Classify input into one of: {{labels}}. Return only the label.
Input: {{input}}

Compose them: e.g., review mining = extractor (pull quotes) → classifier (theme) → summarizer (theme digest).

---

Citations (6 sources)

1. Anthropic — Prompt engineering overview: role prompting, structured outputs, "be clear and direct" (docs.anthropic.com/en/docs/build-with-claude/prompt-engineering) 2. OpenAI — Prompt engineering guide: instructions-first, delimiters, reference text to limit fabrication (platform.openai.com/docs/guides/prompt-engineering) 3. Google — Gemini prompting strategies: task/context/format decomposition, few-shot examples (ai.google.dev/gemini-api/docs/prompting-strategies) 4. Brown et al. — "Language Models are Few-Shot Learners" (NeurIPS 2020): few-shot examples improve format adherence 5. DAIR.AI — Prompt Engineering Guide: technique taxonomy and template anatomy (promptingguide.ai) 6. Ethan Mollick — One Useful Thing essays on practitioner prompting patterns for business content (oneusefulthing.org)

Technique Guide + LLM Governance for Marketing Teams

Two things in one reference: (1) which prompting technique to use for which marketing task, and (2) the governance layer — the rules a marketing team needs so AI-assisted content ships safely, legally, and on-brand at scale.

---

Part 1: Technique Selection for Marketing Tasks

Technique	Use when	Marketing examples
Zero-shot + tight constraints	Task is well-specified and format is simple	SEO meta tags, UTM naming, subject lines
Few-shot (2-5 examples)	Voice/format is hard to describe but easy to show	Brand-voice posts, email tone, ad-angle patterns — paste your 3 best-performing examples
Chain-of-thought / plan-then-write	Multi-step reasoning before output	Campaign briefs (audience → angle → channel → copy), positioning drafts
Structured output (JSON/schema)	Output feeds another tool or script	Ad variant sets, calendar entries, anything `prompt_tester.py` will grade by regex
Decomposition (prompt chains)	One mega-prompt underperforms	Research → outline → draft → brand-voice rewrite → compliance check, each step testable separately
Self-critique pass	Quality gate before human review	"List 3 weaknesses of this draft against the brief, then fix them"

Construction checklist (every marketing prompt): explicit role + goal; the audience and their pain named; output format with limits (chars/words); constraints as bullets; a forbidden list (clichés, banned claims, competitor names); instruction for missing inputs ("if no proof point fits, omit proof — never invent").

Failure patterns to check before testing: objective too broad ("write something engaging"); missing output schema; contradictory constraints (casual tone + formal compliance phrasing in one prompt); no negative instructions, so the model fills gaps with invented stats; hidden assumptions (brand voice referenced but not provided — pass the actual voice profile from .claude/product-marketing-context.md).

---

Part 2: LLM Governance for Marketing

Marketing is a high-exposure surface for AI failure: invented statistics in ads, undisclosed AI-generated endorsements, off-brand tone at scale, and privacy violations in personalization. Governance turns those from incidents into checklist items.

The Governance Stack

1. Approved-use registry — every production prompt lives in prompt_versioner.py with a named owner, author history, and change notes. No anonymous prompt edits in production workflows. 2. Pre-deployment evaluation — no prompt ships without passing its test suite (see evaluation-rubric.md). Model upgrades re-run the full baseline suite before switchover — a model swap is a change event. 3. Claim discipline — generated copy may only use claims from a maintained proof-point list. Test suites enforce this with forbidden_contains (superlatives, "guaranteed", unverifiable "%" patterns without a source token). A human verifies any new claim before it enters the proof list. 4. Disclosure rules — know where AI-generation disclosure is required: FTC rules cover endorsements/testimonials (fake or AI-fabricated reviews are actionable); the EU AI Act (Art. 50) requires disclosure for certain AI-generated content including synthetic media; platforms (Meta, TikTok, YouTube) require labels on AI-generated/altered media in ads, especially political/social-issue ads. 5. Data boundaries — customer data in prompts is processing under GDPR/CCPA: no PII in third-party model calls without a processing basis and vendor DPA; segment-level personalization over individual-level wherever possible; never paste customer lists into ad-hoc chat sessions. 6. Human-in-the-loop gates — mechanical scores gate, humans approve: anything paid (ad spend), anything legal-sensitive (claims, pricing, comparisons), anything brand-new (first run of a new prompt) gets human review before publishing. Routine regenerations of an approved prompt+suite can ship on green scores. 7. Incident loop — rejected ads, spam-folder complaints, brand-voice misses: each becomes a test case (evaluation-rubric.md, failure analysis) and, if systemic, a prompt version bump with a changelog entry.

Roles

Role	Owns
Prompt owner (per workflow)	Template, test suite, version history
Marketing ops	Registry, model-change re-evaluation calendar
Legal/compliance reviewer	Claim list, disclosure map, escalation calls
Brand lead	Voice profile, lexicon-yes/no lists

Minimum Viable Governance (small team)

If the full stack is too heavy: (1) version every production prompt, (2) maintain the forbidden-claims list and wire it into forbidden_contains, (3) human-review everything paid, (4) re-run the suite on model changes. These four catch the expensive failures.

---

Citations (7 sources)

1. NIST — AI Risk Management Framework 1.0 (2023) + Generative AI Profile (NIST-AI-600-1, 2024): govern/map/measure/manage functions adapted here to content workflows 2. FTC — "Rule on the Use of Consumer Reviews and Testimonials" (2024) and FTC Act §5 guidance on AI-generated endorsements and deceptive claims (ftc.gov) 3. EU AI Act — Regulation (EU) 2024/1689, Art. 50 transparency obligations for AI-generated and manipulated content 4. ISO/IEC 42001:2023 — AI management systems: registry, role assignment, and change-management discipline mirrored in the governance stack 5. Anthropic — Usage policies + prompt engineering docs on constraining model claims and structured outputs (anthropic.com/legal/aup, docs.anthropic.com) 6. Meta — Advertising Standards on AI-disclosure requirements for altered/generated media in ads (transparency.fb.com / Meta Business Help Center) 7. GDPR (Regulation 2016/679) Arts. 6, 28 — processing basis and processor agreements governing customer data sent to model vendors

#!/usr/bin/env python3
"""A/B test prompts against structured test cases.

Supports:
- --input JSON payload or stdin JSON payload
- --prompt-a/--prompt-b or file variants
- --cases-file for test suite JSON
- optional --runner-cmd with {prompt} and {input} placeholders

If runner command is omitted, script performs static prompt quality scoring only.
"""

import argparse
import json
import re
import shlex
import subprocess
import sys
from dataclasses import dataclass, asdict
from pathlib import Path
from statistics import mean
from typing import Any, Dict, List, Optional


class CLIError(Exception):
    """Raised for expected CLI errors."""


@dataclass
class CaseScore:
    case_id: str
    prompt_variant: str
    score: float
    matched_expected: int
    missed_expected: int
    forbidden_hits: int
    regex_matches: int
    output_length: int


def parse_args() -> argparse.Namespace:
    parser = argparse.ArgumentParser(description="A/B test prompts against test cases.")
    parser.add_argument("--input", help="JSON input file for full payload.")
    parser.add_argument("--prompt-a", help="Prompt A text.")
    parser.add_argument("--prompt-b", help="Prompt B text.")
    parser.add_argument("--prompt-a-file", help="Path to prompt A file.")
    parser.add_argument("--prompt-b-file", help="Path to prompt B file.")
    parser.add_argument("--cases-file", help="Path to JSON test cases array.")
    parser.add_argument(
        "--runner-cmd",
        help="External command template, e.g. 'llm --prompt {prompt} --input {input}'.",
    )
    parser.add_argument("--format", choices=["text", "json"], default="text", help="Output format.")
    return parser.parse_args()


def read_text_file(path: Optional[str]) -> Optional[str]:
    if not path:
        return None
    try:
        return Path(path).read_text(encoding="utf-8")
    except Exception as exc:
        raise CLIError(f"Failed reading file {path}: {exc}") from exc


def load_payload(args: argparse.Namespace) -> Dict[str, Any]:
    if args.input:
        try:
            return json.loads(Path(args.input).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --input payload: {exc}") from exc

    if not sys.stdin.isatty():
        raw = sys.stdin.read().strip()
        if raw:
            try:
                return json.loads(raw)
            except json.JSONDecodeError as exc:
                raise CLIError(f"Invalid JSON from stdin: {exc}") from exc

    payload: Dict[str, Any] = {}

    prompt_a = args.prompt_a or read_text_file(args.prompt_a_file)
    prompt_b = args.prompt_b or read_text_file(args.prompt_b_file)
    if prompt_a:
        payload["prompt_a"] = prompt_a
    if prompt_b:
        payload["prompt_b"] = prompt_b

    if args.cases_file:
        try:
            payload["cases"] = json.loads(Path(args.cases_file).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --cases-file: {exc}") from exc

    if args.runner_cmd:
        payload["runner_cmd"] = args.runner_cmd

    return payload


def run_runner(runner_cmd: str, prompt: str, case_input: str) -> str:
    cmd = runner_cmd.format(prompt=prompt, input=case_input)
    parts = shlex.split(cmd)
    try:
        proc = subprocess.run(parts, text=True, capture_output=True, check=True)
    except subprocess.CalledProcessError as exc:
        raise CLIError(f"Runner command failed: {exc.stderr.strip()}") from exc
    return proc.stdout.strip()


def static_output(prompt: str, case_input: str) -> str:
    rendered = prompt.replace("{{input}}", case_input)
    return rendered


def score_output(case: Dict[str, Any], output: str, prompt_variant: str) -> CaseScore:
    case_id = str(case.get("id", "case"))
    expected = [str(x) for x in case.get("expected_contains", []) if str(x)]
    forbidden = [str(x) for x in case.get("forbidden_contains", []) if str(x)]
    regexes = [str(x) for x in case.get("expected_regex", []) if str(x)]

    matched_expected = sum(1 for item in expected if item.lower() in output.lower())
    missed_expected = len(expected) - matched_expected
    forbidden_hits = sum(1 for item in forbidden if item.lower() in output.lower())
    regex_matches = 0
    for pattern in regexes:
        try:
            if re.search(pattern, output, flags=re.MULTILINE):
                regex_matches += 1
        except re.error:
            pass

    score = 100.0
    score -= missed_expected * 15
    score -= forbidden_hits * 25
    score += regex_matches * 8

    # Heuristic penalty for unbounded verbosity
    if len(output) > 4000:
        score -= 10
    if len(output.strip()) < 10:
        score -= 10

    score = max(0.0, min(100.0, score))

    return CaseScore(
        case_id=case_id,
        prompt_variant=prompt_variant,
        score=score,
        matched_expected=matched_expected,
        missed_expected=missed_expected,
        forbidden_hits=forbidden_hits,
        regex_matches=regex_matches,
        output_length=len(output),
    )


def aggregate(scores: List[CaseScore]) -> Dict[str, Any]:
    if not scores:
        return {"average": 0.0, "min": 0.0, "max": 0.0, "cases": 0}
    vals = [s.score for s in scores]
    return {
        "average": round(mean(vals), 2),
        "min": round(min(vals), 2),
        "max": round(max(vals), 2),
        "cases": len(vals),
    }


def main() -> int:
    args = parse_args()
    payload = load_payload(args)

    prompt_a = str(payload.get("prompt_a", "")).strip()
    prompt_b = str(payload.get("prompt_b", "")).strip()
    cases = payload.get("cases", [])
    runner_cmd = payload.get("runner_cmd")

    if not prompt_a or not prompt_b:
        raise CLIError("Both prompt_a and prompt_b are required (flags or JSON payload).")
    if not isinstance(cases, list) or not cases:
        raise CLIError("cases must be a non-empty array.")

    scores_a: List[CaseScore] = []
    scores_b: List[CaseScore] = []

    for case in cases:
        if not isinstance(case, dict):
            continue
        case_input = str(case.get("input", "")).strip()

        output_a = run_runner(runner_cmd, prompt_a, case_input) if runner_cmd else static_output(prompt_a, case_input)
        output_b = run_runner(runner_cmd, prompt_b, case_input) if runner_cmd else static_output(prompt_b, case_input)

        scores_a.append(score_output(case, output_a, "A"))
        scores_b.append(score_output(case, output_b, "B"))

    agg_a = aggregate(scores_a)
    agg_b = aggregate(scores_b)
    winner = "A" if agg_a["average"] >= agg_b["average"] else "B"

    result = {
        "summary": {
            "winner": winner,
            "prompt_a": agg_a,
            "prompt_b": agg_b,
            "mode": "runner" if runner_cmd else "static",
        },
        "case_scores": {
            "prompt_a": [asdict(item) for item in scores_a],
            "prompt_b": [asdict(item) for item in scores_b],
        },
    }

    if args.format == "json":
        print(json.dumps(result, indent=2))
    else:
        print("Prompt A/B test result")
        print(f"- mode: {result['summary']['mode']}")
        print(f"- winner: {winner}")
        print(f"- prompt A avg: {agg_a['average']}")
        print(f"- prompt B avg: {agg_b['average']}")
        print("Case details:")
        for item in scores_a + scores_b:
            print(
                f"- case={item.case_id} variant={item.prompt_variant} score={item.score} "
                f"expected+={item.matched_expected} forbidden={item.forbidden_hits} regex={item.regex_matches}"
            )

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except CLIError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        raise SystemExit(2)

#!/usr/bin/env python3
"""Version and diff prompts with a local JSONL history store.

Commands:
- add
- list
- diff
- changelog

Input modes:
- prompt text via --prompt, --prompt-file, --input JSON, or stdin JSON
"""

import argparse
import difflib
import json
import sys
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Dict, List, Optional


class CLIError(Exception):
    """Raised for expected CLI failures."""


@dataclass
class PromptVersion:
    name: str
    version: int
    author: str
    timestamp: str
    change_note: str
    prompt: str


def add_common_subparser_args(parser: argparse.ArgumentParser) -> None:
    parser.add_argument("--store", default=".prompt_versions.jsonl", help="JSONL history file path.")
    parser.add_argument("--input", help="Optional JSON input file with prompt payload.")
    parser.add_argument("--format", choices=["text", "json"], default="text", help="Output format.")


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(description="Version and diff prompts.")

    sub = parser.add_subparsers(dest="command", required=True)

    add = sub.add_parser("add", help="Add a new prompt version.")
    add_common_subparser_args(add)
    add.add_argument("--name", required=True, help="Prompt identifier.")
    add.add_argument("--prompt", help="Prompt text.")
    add.add_argument("--prompt-file", help="Prompt file path.")
    add.add_argument("--author", default="unknown", help="Author name.")
    add.add_argument("--change-note", default="", help="Reason for this revision.")

    ls = sub.add_parser("list", help="List versions for a prompt.")
    add_common_subparser_args(ls)
    ls.add_argument("--name", required=True, help="Prompt identifier.")

    diff = sub.add_parser("diff", help="Diff two prompt versions.")
    add_common_subparser_args(diff)
    diff.add_argument("--name", required=True, help="Prompt identifier.")
    diff.add_argument("--from-version", type=int, required=True)
    diff.add_argument("--to-version", type=int, required=True)

    changelog = sub.add_parser("changelog", help="Show changelog for a prompt.")
    add_common_subparser_args(changelog)
    changelog.add_argument("--name", required=True, help="Prompt identifier.")
    return parser


def read_optional_json(input_path: Optional[str]) -> Dict[str, Any]:
    if input_path:
        try:
            return json.loads(Path(input_path).read_text(encoding="utf-8"))
        except Exception as exc:
            raise CLIError(f"Failed reading --input: {exc}") from exc

    if not sys.stdin.isatty():
        raw = sys.stdin.read().strip()
        if raw:
            try:
                return json.loads(raw)
            except json.JSONDecodeError as exc:
                raise CLIError(f"Invalid JSON from stdin: {exc}") from exc

    return {}


def read_store(path: Path) -> List[PromptVersion]:
    if not path.exists():
        return []
    versions: List[PromptVersion] = []
    for line in path.read_text(encoding="utf-8").splitlines():
        if not line.strip():
            continue
        obj = json.loads(line)
        versions.append(PromptVersion(**obj))
    return versions


def write_store(path: Path, versions: List[PromptVersion]) -> None:
    payload = "\n".join(json.dumps(asdict(v), ensure_ascii=True) for v in versions)
    path.write_text(payload + ("\n" if payload else ""), encoding="utf-8")


def get_prompt_text(args: argparse.Namespace, payload: Dict[str, Any]) -> str:
    if args.prompt:
        return args.prompt
    if args.prompt_file:
        try:
            return Path(args.prompt_file).read_text(encoding="utf-8")
        except Exception as exc:
            raise CLIError(f"Failed reading prompt file: {exc}") from exc
    if payload.get("prompt"):
        return str(payload["prompt"])
    raise CLIError("Prompt content required via --prompt, --prompt-file, --input JSON, or stdin JSON.")


def next_version(versions: List[PromptVersion], name: str) -> int:
    existing = [v.version for v in versions if v.name == name]
    return (max(existing) + 1) if existing else 1


def main() -> int:
    parser = build_parser()
    args = parser.parse_args()
    payload = read_optional_json(args.input)

    store_path = Path(args.store)
    versions = read_store(store_path)

    if args.command == "add":
        prompt_name = str(payload.get("name", args.name))
        prompt_text = get_prompt_text(args, payload)
        author = str(payload.get("author", args.author))
        change_note = str(payload.get("change_note", args.change_note))

        item = PromptVersion(
            name=prompt_name,
            version=next_version(versions, prompt_name),
            author=author,
            timestamp=datetime.now(timezone.utc).isoformat(),
            change_note=change_note,
            prompt=prompt_text,
        )
        versions.append(item)
        write_store(store_path, versions)
        output: Dict[str, Any] = {"added": asdict(item), "store": str(store_path.resolve())}

    elif args.command == "list":
        prompt_name = str(payload.get("name", args.name))
        matches = [asdict(v) for v in versions if v.name == prompt_name]
        output = {"name": prompt_name, "versions": matches}

    elif args.command == "changelog":
        prompt_name = str(payload.get("name", args.name))
        matches = [v for v in versions if v.name == prompt_name]
        entries = [
            {
                "version": v.version,
                "author": v.author,
                "timestamp": v.timestamp,
                "change_note": v.change_note,
            }
            for v in matches
        ]
        output = {"name": prompt_name, "changelog": entries}

    elif args.command == "diff":
        prompt_name = str(payload.get("name", args.name))
        from_v = int(payload.get("from_version", args.from_version))
        to_v = int(payload.get("to_version", args.to_version))

        by_name = [v for v in versions if v.name == prompt_name]
        old = next((v for v in by_name if v.version == from_v), None)
        new = next((v for v in by_name if v.version == to_v), None)
        if not old or not new:
            raise CLIError("Requested versions not found for prompt name.")

        diff_lines = list(
            difflib.unified_diff(
                old.prompt.splitlines(),
                new.prompt.splitlines(),
                fromfile=f"{prompt_name}@v{from_v}",
                tofile=f"{prompt_name}@v{to_v}",
                lineterm="",
            )
        )
        output = {
            "name": prompt_name,
            "from_version": from_v,
            "to_version": to_v,
            "diff": diff_lines,
        }

    else:
        raise CLIError("Unknown command.")

    if args.format == "json":
        print(json.dumps(output, indent=2))
    else:
        if args.command == "add":
            added = output["added"]
            print("Prompt version added")
            print(f"- name: {added['name']}")
            print(f"- version: {added['version']}")
            print(f"- author: {added['author']}")
            print(f"- store: {output['store']}")
        elif args.command in ("list", "changelog"):
            print(f"Prompt: {output['name']}")
            key = "versions" if args.command == "list" else "changelog"
            items = output[key]
            if not items:
                print("- no entries")
            else:
                for item in items:
                    line = f"- v{item.get('version')} by {item.get('author')} at {item.get('timestamp')}"
                    note = item.get("change_note")
                    if note:
                        line += f" | {note}"
                    print(line)
        else:
            print("\n".join(output["diff"]) if output["diff"] else "No differences.")

    return 0


if __name__ == "__main__":
    try:
        raise SystemExit(main())
    except CLIError as exc:
        print(f"ERROR: {exc}", file=sys.stderr)
        raise SystemExit(2)

Related skills

Seo AuditRun structured SEO audits on their SaaS site or content hub and receive a prioritized action plan.173k42.2k

CopywritingGenerate, rewrite, or strengthen persuasive website and landing-page copy that converts visitors into users.164k42.2k

Viral Short FormQuickly generate high-retention hooks, scripts, and outlines for TikTok, Reels, YouTube Shorts, and carousels.132k67

Viral HooksWrite and critique viral hooks for short-form video opening sequences.123k67

Viral Captions And CtasOptimize social media captions and CTAs for viral short-form video reach and saves.123k67

Viral Youtube ShortsWrite and diagnose YouTube Shorts for Shorts Feed and long-form funnel.123k67

How it compares

Choose prompt-engineer-toolkit over generic writing skills when the goal is versioned marketing prompt templates and testable AI content workflows, not one-shot copy generation.

FAQ

What marketing channels does prompt-engineer-toolkit cover?

prompt-engineer-toolkit targets AI-assisted ad copy, email campaigns, and social media posts. The skill rewrites prompts and builds reusable templates so those channels stay on-brand after model changes.

When should Claude Code load prompt-engineer-toolkit?

Load prompt-engineer-toolkit when a developer mentions prompt engineering, prompt templates, AI writing quality, or AI content workflows. Triggers also include requests to improve prompts for marketing use cases.

Is Prompt Engineer Toolkit safe to install?

skills.sh reports 2 of 3 security scanners passed. Review the Security Audits panel on this page before installing in production.

Marketing & SEOcontentdistribution

About

Prompt Engineer Toolkit by the numbers

Add your badge

How do you version and test marketing AI prompts?

Who is it for?

When should I use this skill?

What you get

By the numbers

Files

Prompt Engineer Toolkit

Overview

Core Capabilities

Key Workflows

1. Run Prompt A/B Test

2. Choose Winner With Evidence

3. Version Prompts

4. Regression Loop

Script Interfaces

Pitfalls, Best Practices & Review Checklist

References

Evaluation Design

Versioning Policy

Rollout Strategy

Prompt Engineer Toolkit

Quick Start

Included Tools

References

Installation

Claude Code

OpenAI Codex

OpenClaw

Evaluation Rubric for Marketing Prompts

Layer 1 — Mechanical Score (what prompt_tester.py computes)

Layer 2 — Marketing Quality Dimensions

Building the Test Suite

Anti-Patterns

Citations (6 sources)

Marketing Prompt Templates

1) Ad Copy Variants — ad_copy_shortform

2) Email Campaign Sequence — email_campaign_writer

3) Social Media Post Set — social_post_repurposer

4) Landing Page Section Copy — landing_section_writer

5) SEO Title + Meta Description — seo_meta_writer

6) Brand-Voice Content Rewrite — brand_voice_rewriter

7) Generic Building Blocks

Citations (6 sources)

Technique Guide + LLM Governance for Marketing Teams

Part 1: Technique Selection for Marketing Tasks

Part 2: LLM Governance for Marketing

The Governance Stack

Roles

Minimum Viable Governance (small team)

Citations (7 sources)

Related skills

How it compares

FAQ

What marketing channels does prompt-engineer-toolkit cover?

When should Claude Code load prompt-engineer-toolkit?

Is Prompt Engineer Toolkit safe to install?

This week in AI coding

Layer 1 — Mechanical Score (what `prompt_tester.py` computes)

1) Ad Copy Variants — `ad_copy_shortform`

2) Email Campaign Sequence — `email_campaign_writer`

3) Social Media Post Set — `social_post_repurposer`

4) Landing Page Section Copy — `landing_section_writer`

5) SEO Title + Meta Description — `seo_meta_writer`

6) Brand-Voice Content Rewrite — `brand_voice_rewriter`