Improve try-fix skill: add eval.yaml and fix prompt issues#34807
Improve try-fix skill: add eval.yaml and fix prompt issues#34807
Conversation
- Add tests/eval.yaml with 5 scenarios for empirical validation - Fix contradiction: 'don't search for context' vs 'run git diff' - Remove hardcoded SafeAreaExtensions.cs from sequential warning - Clarify 3-iteration vs orchestrator-controlled attempt limits - Consolidate redundant git checkout warnings Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
🚀 Dogfood this PR with:
curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34807Or
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34807" |
There was a problem hiding this comment.
Pull request overview
This PR updates the .github/skills/try-fix skill prompt and adds an evaluation YAML to support empirical prompt validation, aiming to reduce prompt contradictions and improve orchestration clarity.
Changes:
- Added
.github/skills/try-fix/tests/eval.yamlwith 5 eval scenarios intended to validate key try-fix behaviors and regressions. - Updated
.github/skills/try-fix/SKILL.mdto resolve a “context” principle contradiction, remove issue-specific filenames from warnings, and clarify iteration vs. session attempt limits.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| .github/skills/try-fix/tests/eval.yaml | Adds eval scenarios/assertions/rubrics to validate try-fix workflow behavior. |
| .github/skills/try-fix/SKILL.md | Refines try-fix principles and workflow guidance; clarifies limits and consolidates restore guidance. |
| - type: output_not_contains | ||
| value: "null check on the adapter" | ||
| - type: output_contains |
There was a problem hiding this comment.
In scenario 1, the output_not_contains assertion for "null check on the adapter" is likely to be flaky because a correct response may still restate the existing fix (e.g., "Existing fix is a null check on the adapter; my approach differs"). Consider asserting on what the agent does (e.g., it describes a different approach) rather than banning a phrase that the agent may legitimately repeat for comparison.
| - type: output_not_contains | |
| value: "null check on the adapter" | |
| - type: output_contains | |
| - type: output_contains | |
| value: "lifecycle" | |
| - type: output_contains |
| Note: No iOS simulator is currently available. | ||
| assertions: | ||
| - type: output_not_contains | ||
| value: "Pass" |
There was a problem hiding this comment.
Scenario 3's output_not_contains: "Pass" is overly broad and will fail if the agent writes something like "I cannot mark this as Pass" while correctly reporting Blocked. Make the negative assertion more specific (e.g., target the structured result line such as "Result: Pass" / "result.txt: Pass") so the scenario doesn't reject correct explanations.
| value: "Pass" | |
| value: "Result: Pass" | |
| - type: output_not_contains | |
| value: "result.txt: Pass" |
| - type: output_not_contains | ||
| value: "OnPageSelected" | ||
| - type: output_contains | ||
| value: "Different from existing fix" | ||
| - type: output_contains | ||
| value: "approach" |
There was a problem hiding this comment.
Scenario 4 asserts output_not_contains: "OnPageSelected", but the prompt and rubric require the agent to acknowledge the prior failed attempt, which very likely involves mentioning OnPageSelected explicitly. This assertion will reject compliant outputs; consider instead asserting that the new approach is different (e.g., it references a different lifecycle hook) without forbidding the prior attempt name from being mentioned.
| - type: output_not_contains | |
| value: "OnPageSelected" | |
| - type: output_contains | |
| value: "Different from existing fix" | |
| - type: output_contains | |
| value: "approach" | |
| - type: output_contains | |
| value: "Different from existing fix" | |
| - type: output_contains | |
| value: "approach" | |
| - type: output_contains | |
| value: "before measurement" |
| - type: output_not_contains | ||
| value: "git checkout HEAD" | ||
| - type: output_not_contains | ||
| value: "git restore" | ||
| - type: output_not_contains | ||
| value: "git reset" | ||
| rubric: | ||
| - "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands" | ||
| - "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files" |
There was a problem hiding this comment.
Scenario 5 bans any occurrence of git restore / git reset / git checkout HEAD, but the agent may correctly include those strings in a warning (e.g., "do not use git restore"). That makes the test fragile and may fail even when the workflow is compliant. Prefer positive assertions that the agent uses the baseline/restore script (and/or only forbid the exact command lines it tries to execute).
| - type: output_not_contains | |
| value: "git checkout HEAD" | |
| - type: output_not_contains | |
| value: "git restore" | |
| - type: output_not_contains | |
| value: "git reset" | |
| rubric: | |
| - "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands" | |
| - "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files" | |
| - type: output_contains | |
| value: "EstablishBrokenBaseline.ps1" | |
| - type: output_contains | |
| value: "RestoreAfterAttempt" | |
| rubric: | |
| - "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands" | |
| - "The assertions prefer positive evidence of script-based baseline/restore behavior instead of banning generic git command substrings that may appear in warnings" |
| ``` | ||
|
|
||
| 🚨 Do NOT use `git checkout HEAD -- .` or `git clean` to restore — use the script. | ||
| 🚨 Use `EstablishBrokenBaseline.ps1 -Restore` — not `git checkout`, `git restore`, or `git reset` (see Step 2 for why). | ||
|
|
There was a problem hiding this comment.
Step 8 now says not to use manual git checkout/git restore/git reset and points to Step 2, but later in the same doc the Error Handling table still recommends git checkout HEAD -- . / git clean -fd for unrecoverable git state. This reintroduces the same contradiction the PR is trying to remove. Please align the guidance (e.g., keep EstablishBrokenBaseline as the primary/only restore mechanism, or clearly label any git commands as an absolute last resort with constraints).
…ario - Fix timeouts for scenarios 1, 4, 5: 120s -> 900s (BuildAndRunHostApp takes 5-15 min) - Scenario 1: replace weak output_contains:result.txt with rubric item about artifact creation - Scenario 4: move output_contains:Different-from-existing-fix to rubric item - Add scenario 6: exhausted-iterations records Fail with analysis Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Skill Validator Consensus Report: try-fixSummary
Eval Quality
Points of Agreement (High Confidence)Both validators independently flagged:
Points of Disagreement
Adopted Suggestions
Declined Suggestions
Final RecommendationThe try-fix skill has sound design (especially the Pass/Fail/Blocked criteria) but eval.yaml timeouts were the primary bottleneck. With improved eval.yaml (900s timeouts, 6 scenarios, reduced overfitting), a re-run should show positive improvement scores. Two remaining SKILL.md fixes (error table contradiction, negative trigger guard) should be applied before merge. |
- Fix error handling table: replace 'git checkout HEAD' with EstablishBrokenBaseline.ps1 -Restore - Add negative trigger guard to prevent activation for code review prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rompt styles - Remove all output_contains for skill-specific vocabulary (EstablishBrokenBaseline, Blocked, Fail, Approach, analysis) -- replaced with outcome-based rubric items - Fix negative trigger: change from 'review this PR fix' (ambiguous) to a clear documentation/architecture question that is unambiguous non-fix request - Vary prompt structure: mix formulaic + conversational + orchestrator-style prompts - Replace 'output_contains: Pass/Fail' with 'output_not_contains' anti-pattern guards - Keep all output_not_contains for banned anti-patterns (git checkout, Pass claims, etc.) Overfitting score expected to drop from 0.52 to near 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Blocked scenario (no iOS simulator available) only needs to test agent reasoning, not run a full build. 300s is sufficient and aligns with the Anthropic validator recommendation. No line-break typo was present in the current file (already clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ice constraint - Remove output_not_contains: 'EstablishBrokenBaseline' from negative trigger (non-discriminating: baseline agent would never produce this term) - Move no-device constraint to first line of Blocked scenario prompt so agent sees it before attempting environment setup (fixes 300s timeout) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Scenario 7: Same root cause, different surface (from PR #33134) Tests that agent identifies shared underlying failure mode across attempts, not just surface-level code differences - Scenario 8: Wrong platform code path (from PR #32289) Tests that agent verifies iOS uses Legacy navigation before fixing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Step 3: Warn to verify correct platform code path before implementing (e.g., .iOS.cs compiles for both iOS AND MacCatalyst; Legacy vs modern impls) - Step 4: Warn that 'different' means different ROOT CAUSE, not just code location (same null-check in different call site is NOT a different approach) - Both patterns discovered from real PR failures (#33134, #32289) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Multimodal Review — PR #34807Reviewed by Claude Opus 4.6, GPT-5.4, and GPT-5.3 Codex in parallel. SKILL.md Changes — All 3 agree: ✅ Good
eval.yaml — Unanimous Concerns🔴 Unsatisfiable assertions (all 3 models flagged) Scenarios 1, 4, 7, 8 ban terms the agent must mention when explaining its reasoning:
Suggestion: Narrow 🟡 Missing positive assertions (all 3 flagged) No
🟡 Activation Guard vs Principle 1 (Opus + Codex) Principle 1 says "Always run — never question whether to run" but the new guard says "this skill should not run" for non-fix prompts. Consider: "Always run once activated." 🟡 Other notes
Verdicts
Consensus: REQUEST CHANGES — SKILL.md improvements are solid. Eval assertions need narrowing to avoid false negatives in practice. |
- Add expect_activation: false to negative trigger scenario (native spec field) - Fix Step 9 nested code fence rendering issue in SKILL.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Narrow output_not_contains to action phrases (ban modifying, not mentioning) - Scenario 1: 'OnMeasure' -> 'I will modify the OnMeasure' - Scenario 4: 'OnPageSelected' -> 'I will use OnPageSelected' - Scenario 7: Remove 'parent.MeasuredHeight'/'parent measurement', keep 'fallback to parent' - Scenario 8: 'MauiNavigationImpl' -> 'I will modify MauiNavigationImpl' - Add missing positive assertions: - Scenario 1: output_contains 'approach' - Scenario 3: output_contains 'Blocked' - Scenario 6: output_contains 'Fail' - Fix Principle 1: 'Always run' -> 'Always run once activated' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Scenario 4: rubric no longer mentions 'OnPageSelected' (assertion bans it) - Scenario 7: rubric no longer mentions 'parent measurement' (was banned) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
All review feedback addressed:
|
- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
) > [!NOTE] > Are you waiting for the changes in this PR to be merged? > It would be very helpful if you could <a href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the resulting artifacts</a> from this PR and let us know in a comment if this change resolves your issue. Thank you! ## Summary Improves the try-fix skill based on comprehensive evaluation (empirical + prompt analysis) and production data mining from 6 real agent-reviewed PRs. ### Changes 1. **Added tests/eval.yaml** -- 8 scenarios for empirical A/B validation (6 synthetic + 2 production-derived from PRs dotnet#33134, dotnet#32289) 2. **Fixed SKILL.md issues** -- context contradiction, hardcoded filename, iteration limits, error table, activation guard, root-cause warning, platform path warning, code fence rendering 3. **Used native spec features** -- expect_activation: false for negative trigger scenario ### Evaluation Results - **Isolated improvement**: +51.7% (skill works when activated) - **Dotnet Validator**: 7/10 KEEP - **Anthropic Evaluator**: 9/10 KEEP - **Consensus**: KEEP -- ready to merge --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
> [!NOTE] > Are you waiting for the changes in this PR to be merged? > It would be very helpful if you could <a href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the resulting artifacts</a> from this PR and let us know in a comment if this change resolves your issue. Thank you! ## Summary Adds eval.yaml for the `verify-tests-fail-without-fix` skill, enabling empirical A/B validation via skill-validator. ### Context - This is an internal orchestrator-invoked skill used by `pr-review` to verify tests catch bugs - Follows eval best practices established during the try-fix evaluation cycle (PR #34807) - Part of eval coverage expansion tracked in issue #34814 ### Eval Design - **6 scenarios** covering both verification modes, negative trigger, edge cases, regressions - **0 `output_contains`** -- rubric-based behavioral assertions only (no vocabulary overfitting) - **14 `output_not_contains`** -- anti-pattern guards for common mistakes - **1 `expect_activation: false`** -- native spec field for negative trigger - Realistic timeouts (60s-900s depending on scenario complexity) ### Scenarios 1. **Happy path: full verification** -- Tests two-phase workflow (fail without fix, pass with fix) 2. **Happy path: verify failure only** -- Tests test-creation mode (no fix needed) 3. **Negative trigger** -- Documentation question should not invoke verification 4. **Regression: semantic inversion** -- Tests passing without fix = FAILED verification (not success!) 5. **Edge case: no test files** -- PR without tests can't be verified 6. **Regression: no manual git commands** -- Script handles file revert/restore, not raw git --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Note
Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!
Summary
Improves the try-fix skill based on comprehensive evaluation (empirical + prompt analysis) and production data mining from 6 real agent-reviewed PRs.
Changes
Evaluation Results