Fix issues with chat perf pipeline#323917
Merged
Merged
Conversation
…or leak check The chat-perf workflow flaked daily. Two harness issues: - timeToFirstToken measured the first request in a freshly-launched window, whose instruction-file discovery (willCollectInstructions -> didCollectInstructions) is bimodal (~250ms warm vs ~900ms cold). This made TTFT bimodal and flagged random scenarios each run. Fix: send a throwaway warm-up request and reset to a clean session before measuring, and capture the before-state after warm-up so its cost is excluded from measured deltas. - The leak check verdict used cumulative (final - baseline) residual, which includes one-time first-iteration warm-up (~10MB, ~= threshold). Fix: base the verdict on steady-state residual measured from the end of iteration 1, excluding warm-up.
The setup build job repeatedly hung at 'npm ci' (zero output for ~28min) until the 30min job timeout cancelled it, blocking the whole workflow. Other workflows survive this: pr-node-modules.yml wraps npm ci in a 5-attempt retry loop. Mirror that here, and additionally wrap each attempt in 'timeout 900' so a hung install is killed and retried rather than consuming the entire job timeout. Bump the setup job timeout-minutes 30 -> 50 to leave headroom for retries.
… avoid postinstall hang
…ert warm-up Measured root cause (via trace analysis of slow vs fast runs): every slow run's timeToFirstToken is a single ~550ms MajorGC that lands in the measured window, of which ~97% is V8.GC_OBJECT_DUMP_STATISTICS. That phase only runs when the 'disabled-by-default-v8.gc_stats' trace category is enabled — which the harness enabled — inflating a normal ~15ms GC to ~550ms. Whether such a GC lands in the request window is probabilistic, producing the bimodal ~250ms/~900ms TTFT. It was a measurement artifact, not a product regression. Fix: drop the gc_stats trace category (keep v8.gc + disabled-by-default-v8.gc, which still emit the MajorGC/MarkCompactor events we count). Also revert the earlier warm-up change, which was based on a wrong (cold-instruction-index) hypothesis and introduced layout-count asymmetry between baseline and test.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the reliability of the chat performance CI pipeline by reducing cold-start noise in perf measurements, making leak detection less sensitive to one-time warm-up allocations, and hardening the GitHub Actions workflow against intermittent npm ci/binary-download issues.
Changes:
- Add a warm-up chat request (then reset to a fresh session) before capturing “before” metrics in the perf regression harness to reduce first-request bimodality.
- Update the memory leak harness to compute/report a steady-state residual (excluding the first iteration) and use it for pass/fail decisions and CI summaries.
- Make the chat-perf workflow more robust by skipping binary downloads during
npm ci, adding retries with timeouts, and increasing the setup job timeout; update config comments to reflect the new leak threshold semantics.
Show a summary per file
| File | Description |
|---|---|
| scripts/chat-simulation/test-chat-perf-regression.js | Adds a warm-up request and moves baseline metric capture/profiling start to after warm-up to reduce perf flakiness. |
| scripts/chat-simulation/test-chat-mem-leaks.js | Switches leak verdicting to steady-state residual growth (excluding iteration 1) and updates logs/CI summary accordingly. |
| scripts/chat-simulation/config.jsonc | Updates configuration comments to clarify threshold semantics (steady-state, excluding warm-up). |
| .github/workflows/chat-perf.yml | Improves CI resilience via skip-download env vars, npm ci retry loops with timeouts, and longer setup timeout. |
Review details
- Files reviewed: 4/4 changed files
- Comments generated: 2
- Review effort level: Low
…tional Preserve the V8 gc_stats deep-dive capability behind an explicit --gc-object-stats flag (off by default) so it's available for deliberate GC investigations but never corrupts timing benchmarks. Also clarify in the skill that --no-baseline runs are first-class: a baseline is only for comparison, not required to just measure a build.
Measured: giant-codeblock on main shows +28% layoutCount / +22% recalcStyleCount (stable, p=0) but -9% timeToComplete, -13% TTFT, and -19% long-animation-frame time — i.e. more but cheaper layout ops, and faster+less-janky overall. These two metrics are inflated by CSS animations (compositor-driven, cheap) and are already documented as informational in the skill, yet the exit-code gate and the CI summary still failed on them. Align both with the documented design: move layoutCount / recalcStyleCount to informational so real cost is judged by timeToComplete / layoutDurationMs / LoAF, not raw op counts.
…unts Follow-up to making layoutCount/recalcStyleCount informational: those counts were the only layout-related gate, so dropping them left a coverage gap. Gate on layoutDurationMs instead (already captured, previously ungated) — it's the actual layout cost. Surface it in both summaries; keep the counts informational. Verified safe across all 16 scenarios on the green run: layoutDurationMs is LOWER on main than 1.122.0 everywhere (-10% to -26%), none exceeds the 20% gate. This also confirms the counts were misleading (main does more but cheaper layouts: giant-codeblock +28% layoutCount yet -14% layoutDurationMs).
The startup banner still printed 'threshold NMB total' while the verdict now enforces steady-state residual (excl. first warm-up iteration). Align the wording to avoid confusion in CI logs.
Contributor
|
Base:
|
Add a 'CI runs & pinpointing regressions' section: where the daily chat-perf runs live, how to list/download them (gh), the per-run artifacts and their retention (perf-summary results.json is 1-day; chat-perf-summary/perf-results are 30-day), and a bisection workflow — compare the fixed-baseline test median across dates to find when main regressed or went bimodal/flaky, then trace the slow window (busy-vs-idle between chat marks) to find the cause (as with the gc_stats GC artifact).
rzhao271
approved these changes
Jul 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.