Status:🚀 ACTIVE

Date:📅 unknown

Domain:📦 Skill Evaluation and CI Testing

SPEC-0017: Skill Evaluation and CI Testing

Overview

A testing and evaluation framework for the design plugin's 15 skills, using skill-creator's eval infrastructure with GitHub Actions CI integration. Covers eval authoring, automated test runs, assertion-based grading, benchmark tracking, and cross-skill pipeline testing. See 📝 ADR-0021.

Requirements

Requirement: Eval Authoring

Every skill in skills/*/SKILL.md MUST have at least 2 test prompts defined in evals/evals.json. Each test prompt MUST include a realistic user message (the kind of thing a user would actually type), an expected_output description, and an assertions array with objectively verifiable checks. Test prompts MUST NOT be trivial one-liners — they SHOULD include context, file paths, and specifics that a real user would provide.

Skills MUST be grouped into tiers for test allocation:

Tier 1 (plan, work, review, audit): 3-4 test prompts each
Tier 2 (spec, check, discover, docs): 2-3 test prompts each
Tier 3 (adr, init, prime, list, status, organize, enrich): 2 test prompts each

Scenario: New skill added without evals

WHEN a new skill is added to skills/ and no corresponding entries exist in evals/evals.json
THEN the CI workflow MUST fail with a message: "Skill '{name}' has no eval entries in evals/evals.json. Add at least 2 test prompts."

Scenario: Realistic test prompt quality

WHEN an eval prompt is authored for /design:plan
THEN the prompt includes a specific spec reference, mentions relevant flags, and provides enough context for the skill to operate (e.g., "Plan a sprint from SPEC-0014 for the claude-plugin-design repo on GitHub")

Scenario: Tier allocation respected

WHEN /design:plan (Tier 1) has only 2 test prompts
THEN CI SHOULD warn: "Tier 1 skill 'plan' has only 2 evals (minimum recommended: 3)"

Requirement: Automated Test Runner

A GitHub Actions workflow MUST exist at .github/workflows/skill-evals.yml that runs skill evaluations automatically. The workflow MUST trigger on pull requests that modify skills/**, references/**, or evals/**. The workflow MUST use claude-code-action to invoke run_eval.py from the skill-creator plugin for each modified skill's test prompts.

The workflow MUST support two modes:

Quick mode (default on PRs): Run only Tier 3 evals and evals for modified skills
Full mode (triggered by full-eval label or on release branches): Run all tiers

Each eval run MUST execute both a with-skill run (skill loaded) and a baseline run (no skill) to measure the skill's value-add.

Scenario: PR modifying plan skill triggers evals

WHEN a PR modifies skills/plan/SKILL.md
THEN the CI workflow runs all eval prompts for the plan skill plus Tier 3 evals, in quick mode

Scenario: Full eval on release branch

WHEN a PR targets the release branch or has the full-eval label
THEN the CI workflow runs all eval prompts across all tiers

Scenario: No skill changes in PR

WHEN a PR modifies only docs/ or README.md (no skills or references changed)
THEN the eval workflow does not run (skip)

Requirement: Assertion-Based Grading

Each eval prompt MUST have at least 2 grading assertions in evals/evals.json. Assertions MUST be objectively verifiable — not subjective quality judgments. Assertions MUST use the text, passed, and evidence fields expected by the eval viewer.

Common assertion types for design plugin skills:

File existence: "ADR file was created at docs/adrs/ADR-XXXX-*.md"
Content structure: "Spec contains a ## Requirements section with at least one ### Requirement:"
Section presence: "Security Requirements section present in web-facing spec"
Format compliance: "YAML frontmatter contains status: proposed"
Reference correctness: "Governing comment references the correct ADR number"
Tool usage: "Skill used Glob to find existing artifacts before creating new ones"

Assertions that test content quality (e.g., "the ADR rationale is compelling") SHOULD be deferred to human review via the eval viewer, not automated.

Scenario: Grading a plan skill eval

WHEN the grader evaluates a /design:plan test run
THEN it checks assertions like: "Epic issue created with 'epic' label", "3-4 story issues created", "Each story has a ## Requirements section", "Branch naming conventions present in issue bodies"

Scenario: Assertion failure reported

WHEN an assertion fails (e.g., "Security Requirements section present" but it's missing)
THEN the grading output includes passed: false and evidence: "No ## Security Requirements section found in generated spec.md"

Requirement: Benchmark Tracking

After grading, the workflow MUST run aggregate_benchmark.py to produce benchmark.json with pass rates, timing, and token usage per skill. Benchmark results MUST be posted as a PR comment showing:

Per-skill pass rate (with delta from main branch if available)
Mean execution time per skill
Mean token usage per skill
Any skills that dropped below the 80% pass rate threshold

Benchmark data MUST be committed to evals/benchmarks/ on merge to main, enabling cross-release comparison.

Scenario: Benchmark posted on PR

WHEN evals complete on a PR
THEN a comment is posted with a summary table showing each skill's pass rate, timing, and token usage

Scenario: Pass rate regression detected

WHEN a Tier 1 skill (plan, work, review, audit) drops below 80% pass rate
THEN the CI check MUST fail with: "Tier 1 skill '{name}' pass rate is {N}% (threshold: 80%)"

Scenario: Benchmark history available

WHEN a developer wants to see how skill performance has changed over releases
THEN they can compare evals/benchmarks/ across commits to see trends

Requirement: Eval Viewer Integration

After test runs complete, the workflow MUST generate a static HTML eval viewer using generate_review.py --static for human review of qualitative outputs. The viewer MUST be uploaded as a GitHub Actions artifact so reviewers can download and inspect outputs.

The viewer MUST show:

The test prompt
With-skill output (files created, actions taken)
Baseline output (for comparison)
Formal grades (assertion pass/fail with evidence)
A feedback textbox for human reviewers

Scenario: Reviewer inspects eval outputs

WHEN a reviewer downloads the eval viewer artifact from a PR's CI run
THEN they see each test case with the prompt, skill output, baseline output, and assertion results side-by-side

Scenario: Human feedback captured

WHEN a reviewer leaves feedback in the eval viewer and exports feedback.json
THEN the feedback can be committed to evals/feedback/ for the skill-creator improvement loop

Requirement: Cross-Skill Pipeline Testing

A dedicated evals/pipeline/ directory MUST contain end-to-end test scenarios that invoke multiple skills in sequence against a test repository. Pipeline tests MUST cover at least the core workflow: /design:spec → /design:plan → /design:work → /design:review.

Pipeline tests MUST run only on release branches or when manually triggered (they are expensive). Each pipeline test MUST use a disposable test repository (created via gh repo create --template or a local git init) to avoid polluting real repos.

Scenario: Full pipeline test

WHEN a pipeline eval runs the core workflow
THEN it creates a spec, plans issues from it, implements one issue via worktree, and reviews the resulting PR — verifying that artifacts flow correctly between skills

Scenario: Pipeline test uses disposable repo

WHEN a pipeline test starts
THEN it creates a temporary repository, runs the skill sequence, and cleans up the repo after the test completes

Scenario: Pipeline test run on release

WHEN a PR targets the release branch
THEN pipeline tests are included in the full eval suite

Requirement: Description Optimization

After eval prompts are authored for all skills, the skill-creator's run_loop.py MUST be used to optimize each skill's description field in SKILL.md frontmatter for triggering accuracy. The optimization MUST use a train/test split of trigger eval queries to avoid overfitting. Optimized descriptions MUST be reviewed by a human before committing.

Scenario: Description optimization run

WHEN all eval prompts are authored and baseline benchmarks are established
THEN run_loop.py is run for each skill with 20 trigger eval queries (10 should-trigger, 10 should-not-trigger), producing an optimized description with train and test scores

Scenario: Optimized description reviewed

WHEN run_loop.py produces a best_description for a skill
THEN the before/after descriptions and scores are shown to a human reviewer before updating SKILL.md

Overview​

Requirements​

Requirement: Eval Authoring​

Scenario: New skill added without evals​

Scenario: Realistic test prompt quality​

Scenario: Tier allocation respected​

Requirement: Automated Test Runner​

Scenario: PR modifying plan skill triggers evals​

Scenario: Full eval on release branch​

Scenario: No skill changes in PR​

Requirement: Assertion-Based Grading​

Scenario: Grading a plan skill eval​

Scenario: Assertion failure reported​

Requirement: Benchmark Tracking​

Scenario: Benchmark posted on PR​

Scenario: Pass rate regression detected​

Scenario: Benchmark history available​

Requirement: Eval Viewer Integration​

Scenario: Reviewer inspects eval outputs​

Scenario: Human feedback captured​

Requirement: Cross-Skill Pipeline Testing​

Scenario: Full pipeline test​

Scenario: Pipeline test uses disposable repo​

Scenario: Pipeline test run on release​

Requirement: Description Optimization​

Scenario: Description optimization run​

Scenario: Optimized description reviewed​