docs: add quality evaluation mode to manual testing plan spec by jlevy · Pull Request #39 · jlevy/tryscript

jlevy · 2026-01-31T19:22:14Z

Summary

Adds Quality Evaluation use case for search engines, recommendations, ML outputs
Adds Phase VI: Comparison and Evaluation Modes with side-by-side display, script/LLM evaluators
Updates validation enum: binary | manual | evaluation
Adds new comparison modes: diff | side-by-side | baseline
Documents evaluation strategies and outstanding questions

This extends the manual testing spec to address workflows where outputs may legitimately differ but quality should remain comparable.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Design document for supporting "manual" test scripts that facilitate human/agent review rather than strict pass/fail automation. Addresses use cases for LLM responses, web scraping, visual UX, and other variable outputs. Key features proposed: - Documentation playbook with patterns and anti-patterns - --review mode for update + diff display - validation: binary|manual frontmatter option - Review annotations in test files - CI integration patterns https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

Extends the manual testing workflows spec with: - Quality Evaluation use case (search engines, recommendations, ML outputs) - Phase VI: Comparison and Evaluation Modes - Side-by-side comparison display (not just diff) - Script-based evaluation with thresholds - LLM-based evaluation (future) - Human judgment with structured criteria - Updated validation enum: binary | manual | evaluation - New comparison modes: diff | side-by-side | baseline - Outstanding questions for evaluation semantics This addresses workflows where outputs may legitimately differ but quality should remain comparable - e.g., search results where ordering changes but relevance should be maintained. https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

github-actions · 2026-01-31T19:22:55Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	93.29%	2557 / 2741
🔵	Statements	93.29%	2557 / 2741
🔵	Functions	35.76%	54 / 151
🔵	Branches	36.87%	243 / 659

File Coverage

No changed files found.

Generated in workflow #134 for commit a12891e by the Vitest Coverage Report Action

The .tbd/.gitignore from tbd v0.1.3 was missing entries for docs/ and state.yml that are present in current versions. A previous session ran `tbd prime` which created these files, then committed them. - Updated .tbd/.gitignore to match tbd v0.1.12 template - Removed .tbd/docs/ from tracking (regenerated on setup) - Removed .tbd/state.yml from tracking (local state) - Updated tbd_version in config.yml https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

- Workflows section now shows pnpm/package.json test scripts - CI just calls the same scripts developers use locally - Phase V simplified to be CI-agnostic - Removed verbose GitHub Actions examples - Cleaner, more minimal examples throughout https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

- Consolidate redundant use case sections (web scraping, visual, interactive all merged into "manual review testing") - Fix nested fence syntax (use 4+ backticks for outer fences) - Fix elision patterns: use `...` not `[.. text ..]` - Remove verbose REVIEW/EVALUATE comment syntax - Simplify Phase IV to just recommend markdown for review guidance - Reduce overall spec size by ~270 lines https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Simplified the manual testing workflows spec significantly: - Phase I: Manual testing workflow (--review, validation frontmatter, playbook) - Phase II: Quality evaluation (comparison modes, evaluators) Reduced from ~950 lines to ~220 lines by: - Removing redundant examples - Consolidating related features into single phases - Keeping only essential implementation details - Streamlining outstanding questions https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Elision should only hide irrelevant noise (timestamps, paths), not the content being evaluated. For manual review, you need to see the actual output to assess quality. https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

- `--review` updates files, lists what changed; use `git diff` to review - Remove custom comparison modes (diff, side-by-side, baseline) - Git already provides all this functionality with better tooling - Phase II simplified to just evaluator scripts for automated quality gates - Users can configure Git with delta/diff-so-fancy or use IDE/GitHub https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

claude added 3 commits January 31, 2026 02:54

chore: add tbd docs cache and state

c560ec1

Local tbd state from running tbd prime for issue tracking context. https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN

jlevy changed the base branch from claude/tryscript-manual-testing-ZPMvS to main January 31, 2026 19:26

claude added 6 commits January 31, 2026 19:30

docs: clarify elision pattern guidance

23fd3a0

Elision should only hide irrelevant noise (timestamps, paths), not the content being evaluated. For manual review, you need to see the actual output to assess quality. https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add quality evaluation mode to manual testing plan spec#39

docs: add quality evaluation mode to manual testing plan spec#39
jlevy wants to merge 9 commits into
mainfrom
claude/review-manual-testing-specs-YCe6J

jlevy commented Jan 31, 2026

github-actions Bot commented Jan 31, 2026 •

edited

Loading

Labels

2 participants

Conversation