SPEC-0017: Skill Evaluation and CI Testing
Overview
A testing and evaluation framework for the design plugin's 15 skills, using skill-creator's eval infrastructure with GitHub Actions CI integration. Covers eval authoring, automated test runs, assertion-based grading, benchmark tracking, and cross-skill pipeline testing. See 📝 ADR-0021.
Requirements
Requirement: Eval Authoring
Every skill in skills/*/SKILL.md MUST have at least 2 test prompts defined in evals/evals.json. Each test prompt MUST include a realistic user message (the kind of thing a user would actually type), an expected_output description, and an assertions array with objectively verifiable checks. Test prompts MUST NOT be trivial one-liners — they SHOULD include context, file paths, and specifics that a real user would provide.
Skills MUST be grouped into tiers for test allocation:
- Tier 1 (plan, work, review, audit): 3-4 test prompts each
- Tier 2 (spec, check, discover, docs): 2-3 test prompts each
- Tier 3 (adr, init, prime, list, status, organize, enrich): 2 test prompts each
Scenario: New skill added without evals
- WHEN a new skill is added to
skills/and no corresponding entries exist inevals/evals.json - THEN the CI workflow MUST fail with a message: "Skill '{name}' has no eval entries in evals/evals.json. Add at least 2 test prompts."
Scenario: Realistic test prompt quality
- WHEN an eval prompt is authored for
/design:plan - THEN the prompt includes a specific spec reference, mentions relevant flags, and provides enough context for the skill to operate (e.g., "Plan a sprint from SPEC-0014 for the claude-plugin-design repo on GitHub")
Scenario: Tier allocation respected
- WHEN
/design:plan(Tier 1) has only 2 test prompts - THEN CI SHOULD warn: "Tier 1 skill 'plan' has only 2 evals (minimum recommended: 3)"
Requirement: Automated Test Runner
A GitHub Actions workflow MUST exist at .github/workflows/skill-evals.yml that runs skill evaluations automatically. The workflow MUST trigger on pull requests that modify skills/**, references/**, or evals/**. The workflow MUST use claude-code-action to invoke run_eval.py from the skill-creator plugin for each modified skill's test prompts.
The workflow MUST support two modes:
- Quick mode (default on PRs): Run only Tier 3 evals and evals for modified skills
- Full mode (triggered by
full-evallabel or on release branches): Run all tiers
Each eval run MUST execute both a with-skill run (skill loaded) and a baseline run (no skill) to measure the skill's value-add.
Scenario: PR modifying plan skill triggers evals
- WHEN a PR modifies
skills/plan/SKILL.md - THEN the CI workflow runs all eval prompts for the
planskill plus Tier 3 evals, in quick mode
Scenario: Full eval on release branch
- WHEN a PR targets the
releasebranch or has thefull-evallabel - THEN the CI workflow runs all eval prompts across all tiers
Scenario: No skill changes in PR
- WHEN a PR modifies only
docs/orREADME.md(no skills or references changed) - THEN the eval workflow does not run (skip)
Requirement: Assertion-Based Grading
Each eval prompt MUST have at least 2 grading assertions in evals/evals.json. Assertions MUST be objectively verifiable — not subjective quality judgments. Assertions MUST use the text, passed, and evidence fields expected by the eval viewer.
Common assertion types for design plugin skills:
- File existence: "ADR file was created at docs/adrs/ADR-XXXX-*.md"
- Content structure: "Spec contains a ## Requirements section with at least one ### Requirement:"
- Section presence: "Security Requirements section present in web-facing spec"
- Format compliance: "YAML frontmatter contains status: proposed"
- Reference correctness: "Governing comment references the correct ADR number"
- Tool usage: "Skill used Glob to find existing artifacts before creating new ones"
Assertions that test content quality (e.g., "the ADR rationale is compelling") SHOULD be deferred to human review via the eval viewer, not automated.
Scenario: Grading a plan skill eval
- WHEN the grader evaluates a
/design:plantest run - THEN it checks assertions like: "Epic issue created with 'epic' label", "3-4 story issues created", "Each story has a ## Requirements section", "Branch naming conventions present in issue bodies"
Scenario: Assertion failure reported
- WHEN an assertion fails (e.g., "Security Requirements section present" but it's missing)
- THEN the grading output includes
passed: falseandevidence: "No ## Security Requirements section found in generated spec.md"
Requirement: Benchmark Tracking
After grading, the workflow MUST run aggregate_benchmark.py to produce benchmark.json with pass rates, timing, and token usage per skill. Benchmark results MUST be posted as a PR comment showing:
- Per-skill pass rate (with delta from main branch if available)
- Mean execution time per skill
- Mean token usage per skill
- Any skills that dropped below the 80% pass rate threshold
Benchmark data MUST be committed to evals/benchmarks/ on merge to main, enabling cross-release comparison.
Scenario: Benchmark posted on PR
- WHEN evals complete on a PR
- THEN a comment is posted with a summary table showing each skill's pass rate, timing, and token usage
Scenario: Pass rate regression detected
- WHEN a Tier 1 skill (plan, work, review, audit) drops below 80% pass rate
- THEN the CI check MUST fail with: "Tier 1 skill '{name}' pass rate is {N}% (threshold: 80%)"
Scenario: Benchmark history available
- WHEN a developer wants to see how skill performance has changed over releases
- THEN they can compare
evals/benchmarks/across commits to see trends
Requirement: Eval Viewer Integration
After test runs complete, the workflow MUST generate a static HTML eval viewer using generate_review.py --static for human review of qualitative outputs. The viewer MUST be uploaded as a GitHub Actions artifact so reviewers can download and inspect outputs.
The viewer MUST show:
- The test prompt
- With-skill output (files created, actions taken)
- Baseline output (for comparison)
- Formal grades (assertion pass/fail with evidence)
- A feedback textbox for human reviewers
Scenario: Reviewer inspects eval outputs
- WHEN a reviewer downloads the eval viewer artifact from a PR's CI run
- THEN they see each test case with the prompt, skill output, baseline output, and assertion results side-by-side
Scenario: Human feedback captured
- WHEN a reviewer leaves feedback in the eval viewer and exports
feedback.json - THEN the feedback can be committed to
evals/feedback/for the skill-creator improvement loop
Requirement: Cross-Skill Pipeline Testing
A dedicated evals/pipeline/ directory MUST contain end-to-end test scenarios that invoke multiple skills in sequence against a test repository. Pipeline tests MUST cover at least the core workflow: /design:spec → /design:plan → /design:work → /design:review.
Pipeline tests MUST run only on release branches or when manually triggered (they are expensive). Each pipeline test MUST use a disposable test repository (created via gh repo create --template or a local git init) to avoid polluting real repos.
Scenario: Full pipeline test
- WHEN a pipeline eval runs the core workflow
- THEN it creates a spec, plans issues from it, implements one issue via worktree, and reviews the resulting PR — verifying that artifacts flow correctly between skills
Scenario: Pipeline test uses disposable repo
- WHEN a pipeline test starts
- THEN it creates a temporary repository, runs the skill sequence, and cleans up the repo after the test completes
Scenario: Pipeline test run on release
- WHEN a PR targets the
releasebranch - THEN pipeline tests are included in the full eval suite
Requirement: Description Optimization
After eval prompts are authored for all skills, the skill-creator's run_loop.py MUST be used to optimize each skill's description field in SKILL.md frontmatter for triggering accuracy. The optimization MUST use a train/test split of trigger eval queries to avoid overfitting. Optimized descriptions MUST be reviewed by a human before committing.
Scenario: Description optimization run
- WHEN all eval prompts are authored and baseline benchmarks are established
- THEN
run_loop.pyis run for each skill with 20 trigger eval queries (10 should-trigger, 10 should-not-trigger), producing an optimized description with train and test scores
Scenario: Optimized description reviewed
- WHEN
run_loop.pyproduces abest_descriptionfor a skill - THEN the before/after descriptions and scores are shown to a human reviewer before updating SKILL.md