Skip to content

docs: add quality evaluation mode to manual testing plan spec#39

Open
jlevy wants to merge 9 commits into
mainfrom
claude/review-manual-testing-specs-YCe6J
Open

docs: add quality evaluation mode to manual testing plan spec#39
jlevy wants to merge 9 commits into
mainfrom
claude/review-manual-testing-specs-YCe6J

Conversation

@jlevy

@jlevy jlevy commented Jan 31, 2026

Copy link
Copy Markdown
Owner

Summary

  • Adds Quality Evaluation use case for search engines, recommendations, ML outputs
  • Adds Phase VI: Comparison and Evaluation Modes with side-by-side display, script/LLM evaluators
  • Updates validation enum: binary | manual | evaluation
  • Adds new comparison modes: diff | side-by-side | baseline
  • Documents evaluation strategies and outstanding questions

This extends the manual testing spec to address workflows where outputs may legitimately differ but quality should remain comparable.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5

Design document for supporting "manual" test scripts that facilitate
human/agent review rather than strict pass/fail automation. Addresses
use cases for LLM responses, web scraping, visual UX, and other
variable outputs.

Key features proposed:
- Documentation playbook with patterns and anti-patterns
- --review mode for update + diff display
- validation: binary|manual frontmatter option
- Review annotations in test files
- CI integration patterns

https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Local tbd state from running tbd prime for issue tracking context.

https://claude.ai/code/session_013zTMZFAZESM7uy9oAAxCKN
Extends the manual testing workflows spec with:
- Quality Evaluation use case (search engines, recommendations, ML outputs)
- Phase VI: Comparison and Evaluation Modes
  - Side-by-side comparison display (not just diff)
  - Script-based evaluation with thresholds
  - LLM-based evaluation (future)
  - Human judgment with structured criteria
- Updated validation enum: binary | manual | evaluation
- New comparison modes: diff | side-by-side | baseline
- Outstanding questions for evaluation semantics

This addresses workflows where outputs may legitimately differ but
quality should remain comparable - e.g., search results where ordering
changes but relevance should be maintained.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
@github-actions

github-actions Bot commented Jan 31, 2026

Copy link
Copy Markdown

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 93.29% 2557 / 2741
🔵 Statements 93.29% 2557 / 2741
🔵 Functions 35.76% 54 / 151
🔵 Branches 36.87% 243 / 659
File CoverageNo changed files found.
Generated in workflow #134 for commit a12891e by the Vitest Coverage Report Action
@jlevy jlevy changed the base branch from claude/tryscript-manual-testing-ZPMvS to main January 31, 2026 19:26
The .tbd/.gitignore from tbd v0.1.3 was missing entries for docs/
and state.yml that are present in current versions. A previous session
ran `tbd prime` which created these files, then committed them.

- Updated .tbd/.gitignore to match tbd v0.1.12 template
- Removed .tbd/docs/ from tracking (regenerated on setup)
- Removed .tbd/state.yml from tracking (local state)
- Updated tbd_version in config.yml

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Workflows section now shows pnpm/package.json test scripts
- CI just calls the same scripts developers use locally
- Phase V simplified to be CI-agnostic
- Removed verbose GitHub Actions examples
- Cleaner, more minimal examples throughout

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- Consolidate redundant use case sections (web scraping, visual, interactive
  all merged into "manual review testing")
- Fix nested fence syntax (use 4+ backticks for outer fences)
- Fix elision patterns: use `...` not `[.. text ..]`
- Remove verbose REVIEW/EVALUATE comment syntax
- Simplify Phase IV to just recommend markdown for review guidance
- Reduce overall spec size by ~270 lines

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Simplified the manual testing workflows spec significantly:

- Phase I: Manual testing workflow (--review, validation frontmatter, playbook)
- Phase II: Quality evaluation (comparison modes, evaluators)

Reduced from ~950 lines to ~220 lines by:
- Removing redundant examples
- Consolidating related features into single phases
- Keeping only essential implementation details
- Streamlining outstanding questions

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Elision should only hide irrelevant noise (timestamps, paths), not the
content being evaluated. For manual review, you need to see the actual
output to assess quality.

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
- `--review` updates files, lists what changed; use `git diff` to review
- Remove custom comparison modes (diff, side-by-side, baseline)
- Git already provides all this functionality with better tooling
- Phase II simplified to just evaluator scripts for automated quality gates
- Users can configure Git with delta/diff-so-fancy or use IDE/GitHub

https://claude.ai/code/session_01UaMxx1PpoaJFrA6x94cwF5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants