Skip to content

Fix issues with chat perf pipeline#323917

Merged
pwang347 merged 9 commits into
mainfrom
chat-perf-flake-fix
Jul 1, 2026
Merged

Fix issues with chat perf pipeline#323917
pwang347 merged 9 commits into
mainfrom
chat-perf-flake-fix

Conversation

@pwang347

@pwang347 pwang347 commented Jul 1, 2026

Copy link
Copy Markdown
Member

No description provided.

pwang347 added 3 commits June 30, 2026 10:47
…or leak check

The chat-perf workflow flaked daily. Two harness issues:

- timeToFirstToken measured the first request in a freshly-launched window,
  whose instruction-file discovery (willCollectInstructions ->
  didCollectInstructions) is bimodal (~250ms warm vs ~900ms cold). This made
  TTFT bimodal and flagged random scenarios each run. Fix: send a throwaway
  warm-up request and reset to a clean session before measuring, and capture
  the before-state after warm-up so its cost is excluded from measured deltas.

- The leak check verdict used cumulative (final - baseline) residual, which
  includes one-time first-iteration warm-up (~10MB, ~= threshold). Fix: base
  the verdict on steady-state residual measured from the end of iteration 1,
  excluding warm-up.
The setup build job repeatedly hung at 'npm ci' (zero output for ~28min) until
the 30min job timeout cancelled it, blocking the whole workflow. Other workflows
survive this: pr-node-modules.yml wraps npm ci in a 5-attempt retry loop. Mirror
that here, and additionally wrap each attempt in 'timeout 900' so a hung install
is killed and retried rather than consuming the entire job timeout. Bump the
setup job timeout-minutes 30 -> 50 to leave headroom for retries.
Copilot AI review requested due to automatic review settings July 1, 2026 18:31
…ert warm-up

Measured root cause (via trace analysis of slow vs fast runs): every slow run's
timeToFirstToken is a single ~550ms MajorGC that lands in the measured window,
of which ~97% is V8.GC_OBJECT_DUMP_STATISTICS. That phase only runs when the
'disabled-by-default-v8.gc_stats' trace category is enabled — which the harness
enabled — inflating a normal ~15ms GC to ~550ms. Whether such a GC lands in the
request window is probabilistic, producing the bimodal ~250ms/~900ms TTFT. It was
a measurement artifact, not a product regression.

Fix: drop the gc_stats trace category (keep v8.gc + disabled-by-default-v8.gc,
which still emit the MajorGC/MarkCompactor events we count). Also revert the
earlier warm-up change, which was based on a wrong (cold-instruction-index)
hypothesis and introduced layout-count asymmetry between baseline and test.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of the chat performance CI pipeline by reducing cold-start noise in perf measurements, making leak detection less sensitive to one-time warm-up allocations, and hardening the GitHub Actions workflow against intermittent npm ci/binary-download issues.

Changes:

  • Add a warm-up chat request (then reset to a fresh session) before capturing “before” metrics in the perf regression harness to reduce first-request bimodality.
  • Update the memory leak harness to compute/report a steady-state residual (excluding the first iteration) and use it for pass/fail decisions and CI summaries.
  • Make the chat-perf workflow more robust by skipping binary downloads during npm ci, adding retries with timeouts, and increasing the setup job timeout; update config comments to reflect the new leak threshold semantics.
Show a summary per file
File Description
scripts/chat-simulation/test-chat-perf-regression.js Adds a warm-up request and moves baseline metric capture/profiling start to after warm-up to reduce perf flakiness.
scripts/chat-simulation/test-chat-mem-leaks.js Switches leak verdicting to steady-state residual growth (excluding iteration 1) and updates logs/CI summary accordingly.
scripts/chat-simulation/config.jsonc Updates configuration comments to clarify threshold semantics (steady-state, excluding warm-up).
.github/workflows/chat-perf.yml Improves CI resilience via skip-download env vars, npm ci retry loops with timeouts, and longer setup timeout.

Review details

  • Files reviewed: 4/4 changed files
  • Comments generated: 2
  • Review effort level: Low
Comment thread scripts/chat-simulation/test-chat-perf-regression.js Outdated
Comment thread scripts/chat-simulation/test-chat-mem-leaks.js
pwang347 added 2 commits July 1, 2026 12:06
…tional

Preserve the V8 gc_stats deep-dive capability behind an explicit --gc-object-stats
flag (off by default) so it's available for deliberate GC investigations but never
corrupts timing benchmarks. Also clarify in the skill that --no-baseline runs are
first-class: a baseline is only for comparison, not required to just measure a build.
Measured: giant-codeblock on main shows +28% layoutCount / +22% recalcStyleCount
(stable, p=0) but -9% timeToComplete, -13% TTFT, and -19% long-animation-frame
time — i.e. more but cheaper layout ops, and faster+less-janky overall. These two
metrics are inflated by CSS animations (compositor-driven, cheap) and are already
documented as informational in the skill, yet the exit-code gate and the CI summary
still failed on them. Align both with the documented design: move layoutCount /
recalcStyleCount to informational so real cost is judged by timeToComplete /
layoutDurationMs / LoAF, not raw op counts.
pwang347 added 2 commits July 1, 2026 13:44
…unts

Follow-up to making layoutCount/recalcStyleCount informational: those counts were
the only layout-related gate, so dropping them left a coverage gap. Gate on
layoutDurationMs instead (already captured, previously ungated) — it's the actual
layout cost. Surface it in both summaries; keep the counts informational.

Verified safe across all 16 scenarios on the green run: layoutDurationMs is LOWER
on main than 1.122.0 everywhere (-10% to -26%), none exceeds the 20% gate. This
also confirms the counts were misleading (main does more but cheaper layouts:
giant-codeblock +28% layoutCount yet -14% layoutDurationMs).
The startup banner still printed 'threshold NMB total' while the verdict now
enforces steady-state residual (excl. first warm-up iteration). Align the wording
to avoid confusion in CI logs.
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Base: d70651f2 Current: 108b7d4b

No screenshot changes.

@pwang347 pwang347 marked this pull request as ready for review July 1, 2026 21:21
@pwang347 pwang347 enabled auto-merge (squash) July 1, 2026 21:22
Add a 'CI runs & pinpointing regressions' section: where the daily chat-perf runs
live, how to list/download them (gh), the per-run artifacts and their retention
(perf-summary results.json is 1-day; chat-perf-summary/perf-results are 30-day),
and a bisection workflow — compare the fixed-baseline test median across dates to
find when main regressed or went bimodal/flaky, then trace the slow window
(busy-vs-idle between chat marks) to find the cause (as with the gc_stats GC
artifact).
@pwang347 pwang347 merged commit 90b81ef into main Jul 1, 2026
46 of 47 checks passed
@pwang347 pwang347 deleted the chat-perf-flake-fix branch July 1, 2026 22:10
@vs-code-engineering vs-code-engineering Bot added this to the 1.128.0 milestone Jul 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants