Skip to content

Improve try-fix skill: add eval.yaml and fix prompt issues#34807

Merged
PureWeen merged 11 commits intomainfrom
try-fix-eval-worker-3-2f16
Apr 6, 2026
Merged

Improve try-fix skill: add eval.yaml and fix prompt issues#34807
PureWeen merged 11 commits intomainfrom
try-fix-eval-worker-3-2f16

Conversation

@PureWeen
Copy link
Copy Markdown
Member

@PureWeen PureWeen commented Apr 3, 2026

Note

Are you waiting for the changes in this PR to be merged?
It would be very helpful if you could test the resulting artifacts from this PR and let us know in a comment if this change resolves your issue. Thank you!

Summary

Improves the try-fix skill based on comprehensive evaluation (empirical + prompt analysis) and production data mining from 6 real agent-reviewed PRs.

Changes

  1. Added tests/eval.yaml -- 8 scenarios for empirical A/B validation (6 synthetic + 2 production-derived from PRs [Android] Fixed EmptyView doesn’t display when CollectionView is placed inside a VerticalStackLayout #33134, Fix handler not disconnected when removing non visible pages using RemovePage() #32289)
  2. Fixed SKILL.md issues -- context contradiction, hardcoded filename, iteration limits, error table, activation guard, root-cause warning, platform path warning, code fence rendering
  3. Used native spec features -- expect_activation: false for negative trigger scenario

Evaluation Results

  • Isolated improvement: +51.7% (skill works when activated)
  • Dotnet Validator: 7/10 KEEP
  • Anthropic Evaluator: 9/10 KEEP
  • Consensus: KEEP -- ready to merge
- Add tests/eval.yaml with 5 scenarios for empirical validation
- Fix contradiction: 'don't search for context' vs 'run git diff'
- Remove hardcoded SafeAreaExtensions.cs from sequential warning
- Clarify 3-iteration vs orchestrator-controlled attempt limits
- Consolidate redundant git checkout warnings

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 3, 2026 21:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 3, 2026

🚀 Dogfood this PR with:

⚠️ WARNING: Do not do this without first carefully reviewing the code of this PR to satisfy yourself it is safe.

curl -fsSL https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.sh | bash -s -- 34807

Or

  • Run remotely in PowerShell:
iex "& { $(irm https://raw.githubusercontent.com/dotnet/maui/main/eng/scripts/get-maui-pr.ps1) } 34807"
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the .github/skills/try-fix skill prompt and adds an evaluation YAML to support empirical prompt validation, aiming to reduce prompt contradictions and improve orchestration clarity.

Changes:

  • Added .github/skills/try-fix/tests/eval.yaml with 5 eval scenarios intended to validate key try-fix behaviors and regressions.
  • Updated .github/skills/try-fix/SKILL.md to resolve a “context” principle contradiction, remove issue-specific filenames from warnings, and clarify iteration vs. session attempt limits.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
.github/skills/try-fix/tests/eval.yaml Adds eval scenarios/assertions/rubrics to validate try-fix workflow behavior.
.github/skills/try-fix/SKILL.md Refines try-fix principles and workflow guidance; clarifies limits and consolidates restore guidance.
Comment on lines +23 to +25
- type: output_not_contains
value: "null check on the adapter"
- type: output_contains
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In scenario 1, the output_not_contains assertion for "null check on the adapter" is likely to be flaky because a correct response may still restate the existing fix (e.g., "Existing fix is a null check on the adapter; my approach differs"). Consider asserting on what the agent does (e.g., it describes a different approach) rather than banning a phrase that the agent may legitimately repeat for comparison.

Suggested change
- type: output_not_contains
value: "null check on the adapter"
- type: output_contains
- type: output_contains
value: "lifecycle"
- type: output_contains
Copilot uses AI. Check for mistakes.
Comment thread .github/skills/try-fix/tests/eval.yaml Outdated
Note: No iOS simulator is currently available.
assertions:
- type: output_not_contains
value: "Pass"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 3's output_not_contains: "Pass" is overly broad and will fail if the agent writes something like "I cannot mark this as Pass" while correctly reporting Blocked. Make the negative assertion more specific (e.g., target the structured result line such as "Result: Pass" / "result.txt: Pass") so the scenario doesn't reject correct explanations.

Suggested change
value: "Pass"
value: "Result: Pass"
- type: output_not_contains
value: "result.txt: Pass"
Copilot uses AI. Check for mistakes.
Comment thread .github/skills/try-fix/tests/eval.yaml Outdated
Comment on lines +103 to +108
- type: output_not_contains
value: "OnPageSelected"
- type: output_contains
value: "Different from existing fix"
- type: output_contains
value: "approach"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 4 asserts output_not_contains: "OnPageSelected", but the prompt and rubric require the agent to acknowledge the prior failed attempt, which very likely involves mentioning OnPageSelected explicitly. This assertion will reject compliant outputs; consider instead asserting that the new approach is different (e.g., it references a different lifecycle hook) without forbidding the prior attempt name from being mentioned.

Suggested change
- type: output_not_contains
value: "OnPageSelected"
- type: output_contains
value: "Different from existing fix"
- type: output_contains
value: "approach"
- type: output_contains
value: "Different from existing fix"
- type: output_contains
value: "approach"
- type: output_contains
value: "before measurement"
Copilot uses AI. Check for mistakes.
Comment thread .github/skills/try-fix/tests/eval.yaml Outdated
Comment on lines +132 to +140
- type: output_not_contains
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git reset"
rubric:
- "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"
- "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 5 bans any occurrence of git restore / git reset / git checkout HEAD, but the agent may correctly include those strings in a warning (e.g., "do not use git restore"). That makes the test fragile and may fail even when the workflow is compliant. Prefer positive assertions that the agent uses the baseline/restore script (and/or only forbid the exact command lines it tries to execute).

Suggested change
- type: output_not_contains
value: "git checkout HEAD"
- type: output_not_contains
value: "git restore"
- type: output_not_contains
value: "git reset"
rubric:
- "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"
- "The agent does not use 'git checkout HEAD', 'git restore', or 'git reset' to revert fix files"
- type: output_contains
value: "EstablishBrokenBaseline.ps1"
- type: output_contains
value: "RestoreAfterAttempt"
rubric:
- "The agent uses EstablishBrokenBaseline.ps1 to establish the baseline and restore working directory, not raw git commands"
- "The assertions prefer positive evidence of script-based baseline/restore behavior instead of banning generic git command substrings that may appear in warnings"
Copilot uses AI. Check for mistakes.
Comment on lines 320 to 323
```

🚨 Do NOT use `git checkout HEAD -- .` or `git clean` to restore — use the script.
🚨 Use `EstablishBrokenBaseline.ps1 -Restore` — not `git checkout`, `git restore`, or `git reset` (see Step 2 for why).

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 8 now says not to use manual git checkout/git restore/git reset and points to Step 2, but later in the same doc the Error Handling table still recommends git checkout HEAD -- . / git clean -fd for unrecoverable git state. This reintroduces the same contradiction the PR is trying to remove. Please align the guidance (e.g., keep EstablishBrokenBaseline as the primary/only restore mechanism, or clearly label any git commands as an absolute last resort with constraints).

Copilot uses AI. Check for mistakes.
…ario

- Fix timeouts for scenarios 1, 4, 5: 120s -> 900s (BuildAndRunHostApp takes 5-15 min)
- Scenario 1: replace weak output_contains:result.txt with rubric item about artifact creation
- Scenario 4: move output_contains:Different-from-existing-fix to rubric item
- Add scenario 6: exhausted-iterations records Fail with analysis

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

PureWeen commented Apr 3, 2026

Skill Validator Consensus Report: try-fix

Summary

Metric Result
Eval Generation 🆕 Generated + improved (6 scenarios)
Dotnet Verdict IMPROVE (4/10) -- empirical skill-validator v1.0.0 run
Anthropic Verdict IMPROVE (7.5/10) -- prompt analysis + live Claude CLI test
Consensus IMPROVE (6/10)

Eval Quality

  • 6 scenarios cover: happy path, negative trigger, false-Pass regression, prior-attempt avoidance, baseline script usage, and Fail-with-analysis
  • Timeouts fixed from 60-120s → 900s (the [Draft] Readme WIP #1 confound in the dotnet run -- all 5 original scenarios timed out)
  • Fragile output_contains assertions moved to rubric items to reduce overfitting (0.52 → expected to drop significantly)

Points of Agreement (High Confidence)

Both validators independently flagged:

  • Timeouts too short -- all scenarios timed out at 60-120s; now fixed to 900s
  • Overfitting in assertions -- 0.52 overfitting score; vocabulary-matching replaced with outcome-matching rubrics
  • Error handling table contradicts git prohibition -- line ~358 uses git checkout HEAD which Steps 2/8 explicitly prohibit
  • "Must not claim Pass" is strongest constraint -- +12.7% isolated improvement in empirical testing
  • Token count near diminishing returns -- 3,380 BPE with some redundancy

Points of Disagreement

Topic Dotnet (4/10) Anthropic (7.5/10) Adopted
Overall score Weighted by -4.4% empirical score Weighted by prompt quality 6/10 -- empirical score confounded by timeouts
Skill activation NOT ACTIVATED in plugin mode (5/5) Trigger precision 8/10, recall 5/10 Acceptable -- try-fix is orchestrator-invoked by design
Trigger rewrite Add user-facing triggers Low recall acceptable for internal skill Declined -- would cause false positives

Adopted Suggestions

  1. Fix eval.yaml timeouts (120s → 900s) -- committed
  2. Strengthen assertions to reduce overfitting -- committed
  3. Add Fail scenario (scenario 6) -- committed
  4. 🔲 Fix error handling table contradiction (~line 358) -- git checkout HEAD should be EstablishBrokenBaseline.ps1 -Restore
  5. 🔲 Add negative trigger guard -- skill activated for code review prompts in isolation (-14.5%)

Declined Suggestions

  • User-facing trigger phrases -- skill is explicitly orchestrator-invoked; adding triggers would cause false positives
  • EstablishBrokenBaseline.ps1 existence guard -- script guaranteed to exist in MAUI repo

Final Recommendation

The try-fix skill has sound design (especially the Pass/Fail/Blocked criteria) but eval.yaml timeouts were the primary bottleneck. With improved eval.yaml (900s timeouts, 6 scenarios, reduced overfitting), a re-run should show positive improvement scores. Two remaining SKILL.md fixes (error table contradiction, negative trigger guard) should be applied before merge.

PureWeen and others added 2 commits April 3, 2026 18:53
- Fix error handling table: replace 'git checkout HEAD' with EstablishBrokenBaseline.ps1 -Restore
- Add negative trigger guard to prevent activation for code review prompts

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rompt styles

- Remove all output_contains for skill-specific vocabulary (EstablishBrokenBaseline,
  Blocked, Fail, Approach, analysis) -- replaced with outcome-based rubric items
- Fix negative trigger: change from 'review this PR fix' (ambiguous) to a clear
  documentation/architecture question that is unambiguous non-fix request
- Vary prompt structure: mix formulaic + conversational + orchestrator-style prompts
- Replace 'output_contains: Pass/Fail' with 'output_not_contains' anti-pattern guards
- Keep all output_not_contains for banned anti-patterns (git checkout, Pass claims, etc.)

Overfitting score expected to drop from 0.52 to near 0.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen and others added 2 commits April 4, 2026 07:30
The Blocked scenario (no iOS simulator available) only needs to test
agent reasoning, not run a full build. 300s is sufficient and aligns
with the Anthropic validator recommendation.

No line-break typo was present in the current file (already clean).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ice constraint

- Remove output_not_contains: 'EstablishBrokenBaseline' from negative trigger
  (non-discriminating: baseline agent would never produce this term)
- Move no-device constraint to first line of Blocked scenario prompt
  so agent sees it before attempting environment setup (fixes 300s timeout)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen and others added 2 commits April 4, 2026 16:46
- Scenario 7: Same root cause, different surface (from PR #33134)
  Tests that agent identifies shared underlying failure mode across
  attempts, not just surface-level code differences
- Scenario 8: Wrong platform code path (from PR #32289)
  Tests that agent verifies iOS uses Legacy navigation before fixing

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Step 3: Warn to verify correct platform code path before implementing
  (e.g., .iOS.cs compiles for both iOS AND MacCatalyst; Legacy vs modern impls)
- Step 4: Warn that 'different' means different ROOT CAUSE, not just code location
  (same null-check in different call site is NOT a different approach)
- Both patterns discovered from real PR failures (#33134, #32289)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@kubaflo
Copy link
Copy Markdown
Contributor

kubaflo commented Apr 4, 2026

Multimodal Review — PR #34807

Reviewed by Claude Opus 4.6, GPT-5.4, and GPT-5.3 Codex in parallel.

SKILL.md Changes — All 3 agree: ✅ Good

  • Principle 5 / Step 1 consistency fix — clear win
  • Hardcoded SafeAreaExtensions.cs → generic — good
  • EstablishBrokenBaseline.ps1 consistency — removes contradiction
  • Session limits callout (3 iter / 4-5 invocations) — valuable clarification

eval.yaml — Unanimous Concerns

🔴 Unsatisfiable assertions (all 3 models flagged)

Scenarios 1, 4, 7, 8 ban terms the agent must mention when explaining its reasoning:

Scenario Forbidden term But rubric requires…
Happy path OnMeasure Explain how approach differs from OnMeasure fix
Second attempt OnPageSelected Explicitly state it's avoiding OnPageSelected
Root cause dedup parent measurement Identify parent measurement as the common root cause
Platform path MauiNavigationImpl Verify it's not the active path

Suggestion: Narrow output_not_contains to action phrases (e.g., "modified OnMeasure", "fix in OnMeasure") or move these to rubric-only evaluation.

🟡 Missing positive assertions (all 3 flagged)

No output_contains anywhere — only negative checks. Recommendations:

  • Scenario 3 → assert Blocked appears
  • Scenario 6 → assert Fail appears
  • Happy path → assert artifact creation (approach.md, result.txt)

🟡 Activation Guard vs Principle 1 (Opus + Codex)

Principle 1 says "Always run — never question whether to run" but the new guard says "this skill should not run" for non-fix prompts. Consider: "Always run once activated."

🟡 Other notes

  • PR description says "5 scenarios" but eval.yaml has 8
  • Missing negative-trigger scenarios for code review / PR summary / test-only requests
  • 900s timeout may be tight for scenarios allowing 3 build/test iterations

Verdicts

Model Verdict Key Concern
Claude Opus 4.6 COMMENT Fragile negative assertions → false failures
GPT-5.4 REQUEST CHANGES Unsatisfiable evals + iteration model mismatch
GPT-5.3 Codex REQUEST CHANGES Eval contradictions + missing positive assertions

Consensus: REQUEST CHANGES — SKILL.md improvements are solid. Eval assertions need narrowing to avoid false negatives in practice.

PureWeen and others added 3 commits April 4, 2026 16:57
- Add expect_activation: false to negative trigger scenario (native spec field)
- Fix Step 9 nested code fence rendering issue in SKILL.md

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Narrow output_not_contains to action phrases (ban modifying, not mentioning)
  - Scenario 1: 'OnMeasure' -> 'I will modify the OnMeasure'
  - Scenario 4: 'OnPageSelected' -> 'I will use OnPageSelected'
  - Scenario 7: Remove 'parent.MeasuredHeight'/'parent measurement', keep 'fallback to parent'
  - Scenario 8: 'MauiNavigationImpl' -> 'I will modify MauiNavigationImpl'
- Add missing positive assertions:
  - Scenario 1: output_contains 'approach'
  - Scenario 3: output_contains 'Blocked'
  - Scenario 6: output_contains 'Fail'
- Fix Principle 1: 'Always run' -> 'Always run once activated'

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Scenario 4: rubric no longer mentions 'OnPageSelected' (assertion bans it)
- Scenario 7: rubric no longer mentions 'parent measurement' (was banned)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen
Copy link
Copy Markdown
Member Author

PureWeen commented Apr 4, 2026

All review feedback addressed:

  • Unsatisfiable assertions fixed: Narrowed output_not_contains to action phrases (e.g., I will modify the OnMeasure instead of OnMeasure) so agents can reference approaches while being banned from repeating them
  • Missing positive assertions added: output_contains: "Blocked" (scenario 3), output_contains: "Fail" (scenario 6), output_contains: "approach" (scenario 1)
  • Principle 1 fixed: "Always run" → "Always run once activated" (no conflict with expect_activation: false)
  • PR description updated: Reflects 8 scenarios and final evaluation results
  • Rubric/assertion consistency verified: All 8 scenarios checked -- assertions ban actions, rubrics test outcomes, no conflicts remain
PureWeen added a commit that referenced this pull request Apr 4, 2026
- 6 scenarios covering both verification modes, negative trigger, edge cases
- Rubric-based behavioral assertions (0 output_contains, no vocabulary overfitting)
- Tests the critical 'pass without fix = FAILED verification' semantic inversion
- Production-aware prompt design with varied structure
- Follows eval best practices from try-fix evaluation cycle (PR #34807)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit da3b0a0 into main Apr 6, 2026
4 of 5 checks passed
@PureWeen PureWeen deleted the try-fix-eval-worker-3-2f16 branch April 6, 2026 14:46
Dhivya-SF4094 pushed a commit to Dhivya-SF4094/maui that referenced this pull request Apr 7, 2026
)

> [!NOTE]
> Are you waiting for the changes in this PR to be merged?
> It would be very helpful if you could <a
href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the
resulting artifacts</a> from this PR and let us know in a comment if
this change resolves your issue. Thank you!

## Summary

Improves the try-fix skill based on comprehensive evaluation (empirical
+ prompt analysis) and production data mining from 6 real agent-reviewed
PRs.

### Changes
1. **Added tests/eval.yaml** -- 8 scenarios for empirical A/B validation
(6 synthetic + 2 production-derived from PRs dotnet#33134, dotnet#32289)
2. **Fixed SKILL.md issues** -- context contradiction, hardcoded
filename, iteration limits, error table, activation guard, root-cause
warning, platform path warning, code fence rendering
3. **Used native spec features** -- expect_activation: false for
negative trigger scenario

### Evaluation Results
- **Isolated improvement**: +51.7% (skill works when activated)
- **Dotnet Validator**: 7/10 KEEP
- **Anthropic Evaluator**: 9/10 KEEP
- **Consensus**: KEEP -- ready to merge

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Apr 7, 2026
> [!NOTE]
> Are you waiting for the changes in this PR to be merged?
> It would be very helpful if you could <a
href="https://github.com/dotnet/maui/wiki/Testing-PR-Builds">test the
resulting artifacts</a> from this PR and let us know in a comment if
this change resolves your issue. Thank you!

## Summary

Adds eval.yaml for the `verify-tests-fail-without-fix` skill, enabling
empirical A/B validation via skill-validator.

### Context
- This is an internal orchestrator-invoked skill used by `pr-review` to
verify tests catch bugs
- Follows eval best practices established during the try-fix evaluation
cycle (PR #34807)
- Part of eval coverage expansion tracked in issue #34814

### Eval Design
- **6 scenarios** covering both verification modes, negative trigger,
edge cases, regressions
- **0 `output_contains`** -- rubric-based behavioral assertions only (no
vocabulary overfitting)
- **14 `output_not_contains`** -- anti-pattern guards for common
mistakes
- **1 `expect_activation: false`** -- native spec field for negative
trigger
- Realistic timeouts (60s-900s depending on scenario complexity)

### Scenarios
1. **Happy path: full verification** -- Tests two-phase workflow (fail
without fix, pass with fix)
2. **Happy path: verify failure only** -- Tests test-creation mode (no
fix needed)
3. **Negative trigger** -- Documentation question should not invoke
verification
4. **Regression: semantic inversion** -- Tests passing without fix =
FAILED verification (not success!)
5. **Edge case: no test files** -- PR without tests can't be verified
6. **Regression: no manual git commands** -- Script handles file
revert/restore, not raw git

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen added this to the .NET 10 SR6 milestone Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants