Improve try-fix skill: add eval.yaml and fix prompt issues by PureWeen · Pull Request #34807 · dotnet/maui

PureWeen · 2026-04-03T21:31:41Z

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Improves the try-fix skill based on comprehensive evaluation (empirical + prompt analysis) and production data mining from 6 real agent-reviewed PRs.

Changes

Added tests/eval.yaml -- 8 scenarios for empirical A/B validation (6 synthetic + 2 production-derived from PRs [Android] Fixed EmptyView doesn’t display when CollectionView is placed inside a VerticalStackLayout #33134, Fix handler not disconnected when removing non visible pages using RemovePage() #32289)
Fixed SKILL.md issues -- context contradiction, hardcoded filename, iteration limits, error table, activation guard, root-cause warning, platform path warning, code fence rendering
Used native spec features -- expect_activation: false for negative trigger scenario

Evaluation Results

Isolated improvement: +51.7% (skill works when activated)
Dotnet Validator: 7/10 KEEP
Anthropic Evaluator: 9/10 KEEP
Consensus: KEEP -- ready to merge

- Add tests/eval.yaml with 5 scenarios for empirical validation - Fix contradiction: 'don't search for context' vs 'run git diff' - Remove hardcoded SafeAreaExtensions.cs from sequential warning - Clarify 3-iteration vs orchestrator-controlled attempt limits - Consolidate redundant git checkout warnings Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-04-03T21:31:50Z

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34807

Or

Run remotely in PowerShell:

iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34807"

Copilot

Pull request overview

This PR updates the .github/skills/try-fix skill prompt and adds an evaluation YAML to support empirical prompt validation, aiming to reduce prompt contradictions and improve orchestration clarity.

Changes:

Added .github/skills/try-fix/tests/eval.yaml with 5 eval scenarios intended to validate key try-fix behaviors and regressions.
Updated .github/skills/try-fix/SKILL.md to resolve a “context” principle contradiction, remove issue-specific filenames from warnings, and clarify iteration vs. session attempt limits.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
.github/skills/try-fix/tests/eval.yaml	Adds eval scenarios/assertions/rubrics to validate try-fix workflow behavior.
.github/skills/try-fix/SKILL.md	Refines try-fix principles and workflow guidance; clarifies limits and consolidates restore guidance.

Copilot · 2026-04-03T21:34:45Z

+      - type: output_not_contains
+        value: "null check on the adapter"
+      - type: output_contains


In scenario 1, the output_not_contains assertion for "null check on the adapter" is likely to be flaky because a correct response may still restate the existing fix (e.g., "Existing fix is a null check on the adapter; my approach differs"). Consider asserting on what the agent does (e.g., it describes a different approach) rather than banning a phrase that the agent may legitimately repeat for comparison.

Suggested change

- type: output_not_contains

value: "null check on the adapter"

- type: output_contains

- type: output_contains

value: "lifecycle"

- type: output_contains

Copilot · 2026-04-03T21:34:45Z

+      Note: No iOS simulator is currently available.
+    assertions:
+      - type: output_not_contains
+        value: "Pass"


Scenario 3's output_not_contains: "Pass" is overly broad and will fail if the agent writes something like "I cannot mark this as Pass" while correctly reporting Blocked. Make the negative assertion more specific (e.g., target the structured result line such as "Result: Pass" / "result.txt: Pass") so the scenario doesn't reject correct explanations.

Suggested change

value: "Pass"

value: "Result: Pass"

- type: output_not_contains

value: "result.txt: Pass"

Copilot · 2026-04-03T21:34:45Z

+      - type: output_not_contains
+        value: "OnPageSelected"
+      - type: output_contains
+        value: "Different from existing fix"
+      - type: output_contains
+        value: "approach"


Scenario 4 asserts output_not_contains: "OnPageSelected", but the prompt and rubric require the agent to acknowledge the prior failed attempt, which very likely involves mentioning OnPageSelected explicitly. This assertion will reject compliant outputs; consider instead asserting that the new approach is different (e.g., it references a different lifecycle hook) without forbidding the prior attempt name from being mentioned.

Suggested change

- type: output_not_contains

value: "OnPageSelected"

- type: output_contains

value: "Different from existing fix"

- type: output_contains

value: "approach"

- type: output_contains

value: "Different from existing fix"

- type: output_contains

value: "approach"

- type: output_contains

value: "before measurement"

Copilot · 2026-04-03T21:34:45Z

+      - type: output_not_contains
+        value: "git checkout HEAD"
+      - type: output_not_contains
+        value: "git restore"
+      - type: output_not_contains
+        value: "git reset"
+    rubric:
+      - "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"
+      - "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files"


Scenario 5 bans any occurrence of git restore / git reset / git checkout HEAD, but the agent may correctly include those strings in a warning (e.g., "do not use git restore"). That makes the test fragile and may fail even when the workflow is compliant. Prefer positive assertions that the agent uses the baseline/restore script (and/or only forbid the exact command lines it tries to execute).

Suggested change

- type: output_not_contains

value: "git checkout HEAD"

- type: output_not_contains

value: "git restore"

- type: output_not_contains

value: "git reset"

rubric:

- "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"

- "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files"

- type: output_contains

value: "EstablishBrokenBaseline.ps1"

- type: output_contains

value: "RestoreAfterAttempt"

rubric:

- "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"

- "The assertions prefer positive evidence of script-based baseline/restore behavior instead of banning generic git command substrings that may appear in warnings"

Copilot · 2026-04-03T21:34:46Z

 ```

-🚨 Do NOT use `git checkout HEAD -- .` or `git clean` to restore — use the script.
+🚨 Use `EstablishBrokenBaseline.ps1 -Restore` — not `git checkout`, `git restore`, or `git reset` (see Step 2 for why).



Step 8 now says not to use manual git checkout/git restore/git reset and points to Step 2, but later in the same doc the Error Handling table still recommends git checkout HEAD -- . / git clean -fd for unrecoverable git state. This reintroduces the same contradiction the PR is trying to remove. Please align the guidance (e.g., keep EstablishBrokenBaseline as the primary/only restore mechanism, or clearly label any git commands as an absolute last resort with constraints).

…ario - Fix timeouts for scenarios 1, 4, 5: 120s -> 900s (BuildAndRunHostApp takes 5-15 min) - Scenario 1: replace weak output_contains:result.txt with rubric item about artifact creation - Scenario 4: move output_contains:Different-from-existing-fix to rubric item - Add scenario 6: exhausted-iterations records Fail with analysis Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-04-03T23:47:54Z

Skill Validator Consensus Report: try-fix

Summary

Metric	Result
Eval Generation	🆕 Generated + improved (6 scenarios)
Dotnet Verdict	IMPROVE (4/10) -- empirical `skill-validator v1.0.0` run
Anthropic Verdict	IMPROVE (7.5/10) -- prompt analysis + live Claude CLI test
Consensus	IMPROVE (6/10)

Eval Quality

6 scenarios cover: happy path, negative trigger, false-Pass regression, prior-attempt avoidance, baseline script usage, and Fail-with-analysis
Timeouts fixed from 60-120s → 900s (the [Draft] Readme WIP #1 confound in the dotnet run -- all 5 original scenarios timed out)
Fragile output_contains assertions moved to rubric items to reduce overfitting (0.52 → expected to drop significantly)

Points of Agreement (High Confidence)

Both validators independently flagged:

Timeouts too short -- all scenarios timed out at 60-120s; now fixed to 900s
Overfitting in assertions -- 0.52 overfitting score; vocabulary-matching replaced with outcome-matching rubrics
Error handling table contradicts git prohibition -- line ~358 uses git checkout HEAD which Steps 2/8 explicitly prohibit
"Must not claim Pass" is strongest constraint -- +12.7% isolated improvement in empirical testing
Token count near diminishing returns -- 3,380 BPE with some redundancy

Points of Disagreement

Topic	Dotnet (4/10)	Anthropic (7.5/10)	Adopted
Overall score	Weighted by -4.4% empirical score	Weighted by prompt quality	6/10 -- empirical score confounded by timeouts
Skill activation	NOT ACTIVATED in plugin mode (5/5)	Trigger precision 8/10, recall 5/10	Acceptable -- try-fix is orchestrator-invoked by design
Trigger rewrite	Add user-facing triggers	Low recall acceptable for internal skill	Declined -- would cause false positives

Adopted Suggestions

✅ Fix eval.yaml timeouts (120s → 900s) -- committed
✅ Strengthen assertions to reduce overfitting -- committed
✅ Add Fail scenario (scenario 6) -- committed
🔲 Fix error handling table contradiction (~line 358) -- git checkout HEAD should be EstablishBrokenBaseline.ps1 -Restore
🔲 Add negative trigger guard -- skill activated for code review prompts in isolation (-14.5%)

Declined Suggestions

User-facing trigger phrases -- skill is explicitly orchestrator-invoked; adding triggers would cause false positives
EstablishBrokenBaseline.ps1 existence guard -- script guaranteed to exist in MAUI repo

Final Recommendation

The try-fix skill has sound design (especially the Pass/Fail/Blocked criteria) but eval.yaml timeouts were the primary bottleneck. With improved eval.yaml (900s timeouts, 6 scenarios, reduced overfitting), a re-run should show positive improvement scores. Two remaining SKILL.md fixes (error table contradiction, negative trigger guard) should be applied before merge.

- Fix error handling table: replace 'git checkout HEAD' with EstablishBrokenBaseline.ps1 -Restore - Add negative trigger guard to prevent activation for code review prompts Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rompt styles - Remove all output_contains for skill-specific vocabulary (EstablishBrokenBaseline, Blocked, Fail, Approach, analysis) -- replaced with outcome-based rubric items - Fix negative trigger: change from 'review this PR fix' (ambiguous) to a clear documentation/architecture question that is unambiguous non-fix request - Vary prompt structure: mix formulaic + conversational + orchestrator-style prompts - Replace 'output_contains: Pass/Fail' with 'output_not_contains' anti-pattern guards - Keep all output_not_contains for banned anti-patterns (git checkout, Pass claims, etc.) Overfitting score expected to drop from 0.52 to near 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The Blocked scenario (no iOS simulator available) only needs to test agent reasoning, not run a full build. 300s is sufficient and aligns with the Anthropic validator recommendation. No line-break typo was present in the current file (already clean). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ice constraint - Remove output_not_contains: 'EstablishBrokenBaseline' from negative trigger (non-discriminating: baseline agent would never produce this term) - Move no-device constraint to first line of Blocked scenario prompt so agent sees it before attempting environment setup (fixes 300s timeout) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Scenario 7: Same root cause, different surface (from PR #33134) Tests that agent identifies shared underlying failure mode across attempts, not just surface-level code differences - Scenario 8: Wrong platform code path (from PR #32289) Tests that agent verifies iOS uses Legacy navigation before fixing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Step 3: Warn to verify correct platform code path before implementing (e.g., .iOS.cs compiles for both iOS AND MacCatalyst; Legacy vs modern impls) - Step 4: Warn that 'different' means different ROOT CAUSE, not just code location (same null-check in different call site is NOT a different approach) - Both patterns discovered from real PR failures (#33134, #32289) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kubaflo · 2026-04-04T21:52:18Z

Multimodal Review — PR #34807

Reviewed by Claude Opus 4.6, GPT-5.4, and GPT-5.3 Codex in parallel.

SKILL.md Changes — All 3 agree: ✅ Good

Principle 5 / Step 1 consistency fix — clear win
Hardcoded SafeAreaExtensions.cs → generic — good
EstablishBrokenBaseline.ps1 consistency — removes contradiction
Session limits callout (3 iter / 4-5 invocations) — valuable clarification

eval.yaml — Unanimous Concerns

🔴 Unsatisfiable assertions (all 3 models flagged)

Scenarios 1, 4, 7, 8 ban terms the agent must mention when explaining its reasoning:

Scenario	Forbidden term	But rubric requires…
Happy path	`OnMeasure`	Explain how approach differs from OnMeasure fix
Second attempt	`OnPageSelected`	Explicitly state it's avoiding OnPageSelected
Root cause dedup	`parent measurement`	Identify parent measurement as the common root cause
Platform path	`MauiNavigationImpl`	Verify it's not the active path

Suggestion: Narrow output_not_contains to action phrases (e.g., "modified OnMeasure", "fix in OnMeasure") or move these to rubric-only evaluation.

🟡 Missing positive assertions (all 3 flagged)

No output_contains anywhere — only negative checks. Recommendations:

Scenario 3 → assert Blocked appears
Scenario 6 → assert Fail appears
Happy path → assert artifact creation (approach.md, result.txt)

🟡 Activation Guard vs Principle 1 (Opus + Codex)

Principle 1 says "Always run — never question whether to run" but the new guard says "this skill should not run" for non-fix prompts. Consider: "Always run once activated."

🟡 Other notes

PR description says "5 scenarios" but eval.yaml has 8
Missing negative-trigger scenarios for code review / PR summary / test-only requests
900s timeout may be tight for scenarios allowing 3 build/test iterations

Verdicts

Model	Verdict	Key Concern
Claude Opus 4.6	COMMENT	Fragile negative assertions → false failures
GPT-5.4	REQUEST CHANGES	Unsatisfiable evals + iteration model mismatch
GPT-5.3 Codex	REQUEST CHANGES	Eval contradictions + missing positive assertions

Consensus: REQUEST CHANGES — SKILL.md improvements are solid. Eval assertions need narrowing to avoid false negatives in practice.

- Add expect_activation: false to negative trigger scenario (native spec field) - Fix Step 9 nested code fence rendering issue in SKILL.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Narrow output_not_contains to action phrases (ban modifying, not mentioning) - Scenario 1: 'OnMeasure' -> 'I will modify the OnMeasure' - Scenario 4: 'OnPageSelected' -> 'I will use OnPageSelected' - Scenario 7: Remove 'parent.MeasuredHeight'/'parent measurement', keep 'fallback to parent' - Scenario 8: 'MauiNavigationImpl' -> 'I will modify MauiNavigationImpl' - Add missing positive assertions: - Scenario 1: output_contains 'approach' - Scenario 3: output_contains 'Blocked' - Scenario 6: output_contains 'Fail' - Fix Principle 1: 'Always run' -> 'Always run once activated' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Scenario 4: rubric no longer mentions 'OnPageSelected' (assertion bans it) - Scenario 7: rubric no longer mentions 'parent measurement' (was banned) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen · 2026-04-04T22:06:41Z

All review feedback addressed:

Unsatisfiable assertions fixed: Narrowed output_not_contains to action phrases (e.g., I will modify the OnMeasure instead of OnMeasure) so agents can reference approaches while being banned from repeating them
Missing positive assertions added: output_contains: "Blocked" (scenario 3), output_contains: "Fail" (scenario 6), output_contains: "approach" (scenario 1)
Principle 1 fixed: "Always run" → "Always run once activated" (no conflict with expect_activation: false)
PR description updated: Reflects 8 scenarios and final evaluation results
Rubric/assertion consistency verified: All 8 scenarios checked -- assertions ban actions, rubrics test outcomes, no conflicts remain

- 6 scenarios covering both verification modes, negative trigger, edge cases - Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting) - Tests the critical 'pass without fix = FAILED verification' semantic inversion - Production-aware prompt design with varied structure - Follows eval best practices from try-fix evaluation cycle (PR #34807) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) > [!NOTE] > Are you waiting for the changes in this PR to be merged? > It would be very helpful if you could <a href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the resulting artifacts</a> from this PR and let us know in a comment if this change resolves your issue. Thank you! ## Summary Improves the try-fix skill based on comprehensive evaluation (empirical + prompt analysis) and production data mining from 6 real agent-reviewed PRs. ### Changes 1. **Added tests/eval.yaml** -- 8 scenarios for empirical A/B validation (6 synthetic + 2 production-derived from PRs dotnet#33134, dotnet#32289) 2. **Fixed SKILL.md issues** -- context contradiction, hardcoded filename, iteration limits, error table, activation guard, root-cause warning, platform path warning, code fence rendering 3. **Used native spec features** -- expect_activation: false for negative trigger scenario ### Evaluation Results - **Isolated improvement**: +51.7% (skill works when activated) - **Dotnet Validator**: 7/10 KEEP - **Anthropic Evaluator**: 9/10 KEEP - **Consensus**: KEEP -- ready to merge --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

> [!NOTE] > Are you waiting for the changes in this PR to be merged? > It would be very helpful if you could <a href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the resulting artifacts</a> from this PR and let us know in a comment if this change resolves your issue. Thank you! ## Summary Adds eval.yaml for the `verify-tests-fail-without-fix` skill, enabling empirical A/B validation via skill-validator. ### Context - This is an internal orchestrator-invoked skill used by `pr-review` to verify tests catch bugs - Follows eval best practices established during the try-fix evaluation cycle (PR #34807) - Part of eval coverage expansion tracked in issue #34814 ### Eval Design - **6 scenarios** covering both verification modes, negative trigger, edge cases, regressions - **0 `output_contains`** -- rubric-based behavioral assertions only (no vocabulary overfitting) - **14 `output_not_contains`** -- anti-pattern guards for common mistakes - **1 `expect_activation: false`** -- native spec field for negative trigger - Realistic timeouts (60s-900s depending on scenario complexity) ### Scenarios 1. **Happy path: full verification** -- Tests two-phase workflow (fail without fix, pass with fix) 2. **Happy path: verify failure only** -- Tests test-creation mode (no fix needed) 3. **Negative trigger** -- Documentation question should not invoke verification 4. **Regression: semantic inversion** -- Tests passing without fix = FAILED verification (not success!) 5. **Edge case: no test files** -- PR without tests can't be verified 6. **Regression: no manual git commands** -- Script handles file revert/restore, not raw git --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 3, 2026 21:31

Copilot started reviewing on behalf of PureWeen April 3, 2026 21:32 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

PureWeen and others added 2 commits April 3, 2026 18:53

github-actions Bot mentioned this pull request Apr 4, 2026

[repo-status] 📊 Daily Repo Status - April 4, 2026 #34809

Closed

PureWeen and others added 2 commits April 4, 2026 07:30

PureWeen mentioned this pull request Apr 4, 2026

Establish eval.yaml lifecycle and CI integration for skill validation #34814

Open

7 tasks

PureWeen and others added 2 commits April 4, 2026 16:46

PureWeen and others added 3 commits April 4, 2026 16:57

Apply schema-native improvements: expect_activation and code fence fix

7010339

- Add expect_activation: false to negative trigger scenario (native spec field) - Fix Step 9 nested code fence rendering issue in SKILL.md Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fix remaining assertion/rubric conflicts in scenarios 4 and 7

eb5859d

- Scenario 4: rubric no longer mentions 'OnPageSelected' (assertion bans it) - Scenario 7: rubric no longer mentions 'parent measurement' (was banned) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen mentioned this pull request Apr 4, 2026

Add eval.yaml for verify-tests-fail-without-fix skill #34815

Merged

github-actions Bot mentioned this pull request Apr 5, 2026

[repo-status] 📊 Daily Repo Status - April 5, 2026 #34817

Closed

kubaflo approved these changes Apr 5, 2026

View reviewed changes

PureWeen merged commit da3b0a0 into main Apr 6, 2026
4 of 5 checks passed

PureWeen deleted the try-fix-eval-worker-3-2f16 branch April 6, 2026 14:46

github-actions Bot mentioned this pull request Apr 7, 2026

[repo-status] Daily Repo Status - April 7, 2026 #34840

Closed

github-actions Bot mentioned this pull request Apr 8, 2026

[repo-status] Daily Repo Status - April 8, 2026 🌟 #34871

Closed

PureWeen mentioned this pull request Apr 8, 2026

Improve code-review eval: add regression detection scenarios #34884

Open

PureWeen added this to the .NET 10 SR6 milestone Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve try-fix skill: add eval.yaml and fix prompt issues#34807

Improve try-fix skill: add eval.yaml and fix prompt issues#34807
PureWeen merged 11 commits intomainfrom
try-fix-eval-worker-3-2f16

PureWeen commented Apr 3, 2026 •

edited

Loading

github-actions Bot commented Apr 3, 2026 •

edited

Loading

Copilot AI left a comment

Copilot AI Apr 3, 2026

Copilot AI Apr 3, 2026

Copilot AI Apr 3, 2026

Copilot AI Apr 3, 2026

Copilot AI Apr 3, 2026

PureWeen commented Apr 3, 2026

kubaflo commented Apr 4, 2026

PureWeen commented Apr 4, 2026

Uh oh!

Labels

3 participants

Conversation

PureWeen commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Evaluation Results

github-actions Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

PureWeen commented Apr 3, 2026

Skill Validator Consensus Report: try-fix

Summary

Eval Quality

Points of Agreement (High Confidence)

Points of Disagreement

Adopted Suggestions

Declined Suggestions

Final Recommendation

kubaflo commented Apr 4, 2026

Multimodal Review — PR #34807

SKILL.md Changes — All 3 agree: ✅ Good

eval.yaml — Unanimous Concerns

Verdicts

PureWeen commented Apr 4, 2026

Uh oh!

Labels

3 participants

PureWeen commented Apr 3, 2026 •

edited

Loading

github-actions Bot commented Apr 3, 2026 •

edited

Loading