Fix issues with chat perf pipeline by pwang347 · Pull Request #323917 · microsoft/vscode

pwang347 · 2026-07-01T18:31:10Z

No description provided.

…or leak check The chat-perf workflow flaked daily. Two harness issues: - timeToFirstToken measured the first request in a freshly-launched window, whose instruction-file discovery (willCollectInstructions -> didCollectInstructions) is bimodal (~250ms warm vs ~900ms cold). This made TTFT bimodal and flagged random scenarios each run. Fix: send a throwaway warm-up request and reset to a clean session before measuring, and capture the before-state after warm-up so its cost is excluded from measured deltas. - The leak check verdict used cumulative (final - baseline) residual, which includes one-time first-iteration warm-up (~10MB, ~= threshold). Fix: base the verdict on steady-state residual measured from the end of iteration 1, excluding warm-up.

The setup build job repeatedly hung at 'npm ci' (zero output for ~28min) until the 30min job timeout cancelled it, blocking the whole workflow. Other workflows survive this: pr-node-modules.yml wraps npm ci in a 5-attempt retry loop. Mirror that here, and additionally wrap each attempt in 'timeout 900' so a hung install is killed and retried rather than consuming the entire job timeout. Bump the setup job timeout-minutes 30 -> 50 to leave headroom for retries.

… avoid postinstall hang

…ert warm-up Measured root cause (via trace analysis of slow vs fast runs): every slow run's timeToFirstToken is a single ~550ms MajorGC that lands in the measured window, of which ~97% is V8.GC_OBJECT_DUMP_STATISTICS. That phase only runs when the 'disabled-by-default-v8.gc_stats' trace category is enabled — which the harness enabled — inflating a normal ~15ms GC to ~550ms. Whether such a GC lands in the request window is probabilistic, producing the bimodal ~250ms/~900ms TTFT. It was a measurement artifact, not a product regression. Fix: drop the gc_stats trace category (keep v8.gc + disabled-by-default-v8.gc, which still emit the MajorGC/MarkCompactor events we count). Also revert the earlier warm-up change, which was based on a wrong (cold-instruction-index) hypothesis and introduced layout-count asymmetry between baseline and test.

Copilot

Pull request overview

This PR improves the reliability of the chat performance CI pipeline by reducing cold-start noise in perf measurements, making leak detection less sensitive to one-time warm-up allocations, and hardening the GitHub Actions workflow against intermittent npm ci/binary-download issues.

Changes:

Add a warm-up chat request (then reset to a fresh session) before capturing “before” metrics in the perf regression harness to reduce first-request bimodality.
Update the memory leak harness to compute/report a steady-state residual (excluding the first iteration) and use it for pass/fail decisions and CI summaries.
Make the chat-perf workflow more robust by skipping binary downloads during npm ci, adding retries with timeouts, and increasing the setup job timeout; update config comments to reflect the new leak threshold semantics.

Show a summary per file

File	Description
scripts/chat-simulation/test-chat-perf-regression.js	Adds a warm-up request and moves baseline metric capture/profiling start to after warm-up to reduce perf flakiness.
scripts/chat-simulation/test-chat-mem-leaks.js	Switches leak verdicting to steady-state residual growth (excluding iteration 1) and updates logs/CI summary accordingly.
scripts/chat-simulation/config.jsonc	Updates configuration comments to clarify threshold semantics (steady-state, excluding warm-up).
.github/workflows/chat-perf.yml	Improves CI resilience via skip-download env vars, `npm ci` retry loops with timeouts, and longer setup timeout.

Review details

Files reviewed: 4/4 changed files
Comments generated: 2
Review effort level: Low

…tional Preserve the V8 gc_stats deep-dive capability behind an explicit --gc-object-stats flag (off by default) so it's available for deliberate GC investigations but never corrupts timing benchmarks. Also clarify in the skill that --no-baseline runs are first-class: a baseline is only for comparison, not required to just measure a build.

Measured: giant-codeblock on main shows +28% layoutCount / +22% recalcStyleCount (stable, p=0) but -9% timeToComplete, -13% TTFT, and -19% long-animation-frame time — i.e. more but cheaper layout ops, and faster+less-janky overall. These two metrics are inflated by CSS animations (compositor-driven, cheap) and are already documented as informational in the skill, yet the exit-code gate and the CI summary still failed on them. Align both with the documented design: move layoutCount / recalcStyleCount to informational so real cost is judged by timeToComplete / layoutDurationMs / LoAF, not raw op counts.

…unts Follow-up to making layoutCount/recalcStyleCount informational: those counts were the only layout-related gate, so dropping them left a coverage gap. Gate on layoutDurationMs instead (already captured, previously ungated) — it's the actual layout cost. Surface it in both summaries; keep the counts informational. Verified safe across all 16 scenarios on the green run: layoutDurationMs is LOWER on main than 1.122.0 everywhere (-10% to -26%), none exceeds the 20% gate. This also confirms the counts were misleading (main does more but cheaper layouts: giant-codeblock +28% layoutCount yet -14% layoutDurationMs).

The startup banner still printed 'threshold NMB total' while the verdict now enforces steady-state residual (excl. first warm-up iteration). Align the wording to avoid confusion in CI logs.

github-actions · 2026-07-01T21:20:17Z

Base: d70651f2 Current: 108b7d4b

~~No screenshot changes.~~

Add a 'CI runs & pinpointing regressions' section: where the daily chat-perf runs live, how to list/download them (gh), the per-run artifacts and their retention (perf-summary results.json is 1-day; chat-perf-summary/perf-results are 30-day), and a bisection workflow — compare the fixed-baseline test median across dates to find when main regressed or went bimodal/flaky, then trace the slow window (busy-vs-idle between chat marks) to find the cause (as with the gc_stats GC artifact).

pwang347 added 3 commits June 30, 2026 10:47

chat-perf: skip electron/playwright binary downloads during npm ci to…

9aa34d8

… avoid postinstall hang

Copilot AI review requested due to automatic review settings July 1, 2026 18:31

Copilot started reviewing on behalf of pwang347 July 1, 2026 18:32 View session

Copilot AI reviewed Jul 1, 2026

View reviewed changes

Comment thread scripts/chat-simulation/test-chat-perf-regression.js Outdated

Comment thread scripts/chat-simulation/test-chat-mem-leaks.js

pwang347 added 2 commits July 1, 2026 12:06

vs-code-engineering Bot assigned pwang347 Jul 1, 2026

pwang347 added 2 commits July 1, 2026 13:44

chat-perf: leak banner says steady-state, not total (PR review)

be69a9c

The startup banner still printed 'threshold NMB total' while the verdict now enforces steady-state residual (excl. first warm-up iteration). Align the wording to avoid confusion in CI logs.

pwang347 marked this pull request as ready for review July 1, 2026 21:21

pwang347 enabled auto-merge (squash) July 1, 2026 21:22

rzhao271 approved these changes Jul 1, 2026

View reviewed changes

pwang347 merged commit 90b81ef into main Jul 1, 2026
46 of 47 checks passed

pwang347 deleted the chat-perf-flake-fix branch July 1, 2026 22:10

vs-code-engineering Bot added this to the 1.128.0 milestone Jul 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issues with chat perf pipeline#323917

Fix issues with chat perf pipeline#323917
pwang347 merged 9 commits into
mainfrom
chat-perf-flake-fix

pwang347 commented Jul 1, 2026

Copilot AI left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Labels

3 participants

Uh oh!

Conversation

pwang347 commented Jul 1, 2026

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Review details

Uh oh!

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Labels

3 participants

github-actions Bot commented Jul 1, 2026 •

edited

Loading