Skip to content

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62

Open
wdeynes wants to merge 1 commit into
browser-use:mainfrom
wdeynes:fix/concat-av-drift
Open

render: fix progressive A/V (lip-sync) drift across multi-segment concats#62
wdeynes wants to merge 1 commit into
browser-use:mainfrom
wdeynes:fix/concat-av-drift

Conversation

@wdeynes

@wdeynes wdeynes commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Long EDLs render with progressively worsening lip-sync: the audio runs ahead of the video, getting worse toward the end of the timeline. On a real 37-segment / 103 s edit, audio was −570 ms early by the last segment — blatantly visible.

Root cause

extract_segment() writes each segment with -t <duration> at CFR -r 24 + AAC audio:

  • the video stream rounds up to a whole frame (41.7 ms steps at 24 fps)
  • the audio stream keeps the raw -t length (quantized only by AAC's 21.3 ms frame size)

so every segment's audio ends up ~17–40 ms shorter than its video. concat_segments() then concatenates with -c copy, and the concat demuxer packs each stream back-to-back independently — so the mismatch accumulates: segment N's audio plays roughly N × 17 ms before its video.

Measured on a 37-segment EDL (per-segment ffprobe stream durations):

video sum audio sum cumulative
before 103.792 s 103.170 s −0.622 s

Cross-correlating the output audio against the source audio at each segment's video-timeline position (numpy, 16 kHz mono, ±0.8 s search window) confirms the drift is progressive and audible:

segment output pos lag before lag after
0 0.0 s −21 ms 0.0 ms
8 27 s −116 ms −0.1 ms
19 57 s −318 ms −0.1 ms
29 84 s −423 ms 0.0 ms
35 101 s −569 ms 0.0 ms

(correlation confidence 0.90–1.00 at every checkpoint)

Fix

  1. Quantize each segment to whole output frames: n_frames = round(duration × OUTPUT_FPS), vdur = n_frames / OUTPUT_FPS; cap video with -frames:v (the -t now overshoots by 0.5 s purely to give the audio filters enough input).
  2. Force audio to exactly vdur with atrim=end=vdur,apad=whole_dur=vdur (the 30 ms fades are unchanged, now timed against vdur).
  3. Use sample-exact PCM (pcm_s16le) in .mov intermediates instead of AAC mp4 segments — PCM stream durations are sample-accurate, with no encoder priming or frame rounding to survive the concat demuxer.
  4. Encode AAC once at the final composite: build_final_composite()'s early-return and filter paths now use -c:a aac -b:a 192k instead of -c copy/-c:a copy. Final deliverables are unchanged (.mp4, h264 + AAC, +faststart).

After

  • all 37 segments: |audio − video| = 0.0 ms, cumulative diff 0.0000 s
  • cross-correlation lag 0.0 ms (±0.1 ms) at every checkpoint across the 103 s timeline
  • container duration now matches the EDL sum exactly (was +0.6 s)

Notes

  • Intermediates are renamed clips_*/seg_NN_<src>.mov and base*.mov (PCM-in-mp4 is poorly supported; final outputs are still mp4). PCM audio costs ~11.5 MB/min of intermediate disk — negligible next to the video data.
  • Behavior when a range overruns the source EOF is unchanged: apad fills audio to vdur; video may still come up short, as before.

🤖 Generated with Claude Code


Summary by cubic

Fixes progressive lip-sync drift across multi-segment renders. Audio now stays aligned with video across the entire timeline (measured −570 ms -> 0 ms on a 103 s/37-segment edit).

  • Bug Fixes

    • Quantize each segment to whole frames at OUTPUT_FPS=24 and cap video with -frames:v.
    • Force audio to match vdur exactly using atrim + apad (30 ms fades now timed to vdur).
    • Switch intermediates to sample-exact PCM .mov for safe -c copy concat; encode AAC only once in the final composite.
    • Outcome: 0.0 ms drift across the full edit; container duration matches the EDL sum.
  • Migration

    • Intermediates are now .mov: clips_*/seg_*.mov and base*.mov. Update any scripts that referenced .mp4.
    • Final deliverables remain .mp4 (H.264 + AAC, +faststart).

Written for commit f7206d8. Summary will update on new commits.

Review in cubic

Per-segment video rounds up to whole 24fps frames while AAC audio keeps
the raw -t duration (~17-40ms shorter per segment). The -c copy concat
packs each stream back-to-back independently, so the mismatch
accumulates into progressive audio-early drift — measured -570ms over a
37-segment, 103s timeline via cross-correlation of output vs source
audio.

Quantize each segment to whole output frames (-frames:v, vdur=n/fps),
force the audio to exactly vdur (atrim + apad), and write sample-exact
PCM .mov intermediates, encoding AAC once at the final composite. After
the fix every segment has |a-v| = 0ms and output-vs-source
cross-correlation shows 0.0ms lag at every checkpoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/render.py">

<violation number="1" location="helpers/render.py:531">
P2: Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread helpers/render.py
run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True)
# No filters — copy video, encode the PCM intermediate audio to AAC for mp4
run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy",
"-c:a", "aac", "-b:a", "192k", "-ar", "48000",

@cubic-dev-ai cubic-dev-ai Bot Jun 10, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Double AAC encoding in loudnorm path: build_final_composite() encodes PCM → AAC for the prenorm intermediate, then apply_loudnorm_two_pass() re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/render.py, line 531:

<comment>Double AAC encoding in loudnorm path: `build_final_composite()` encodes PCM → AAC for the prenorm intermediate, then `apply_loudnorm_two_pass()` re-encodes AAC → AAC. This contradicts the PR goal of a single final AAC encode and wastes CPU while degrading audio quality.</comment>

<file context>
@@ -508,8 +526,10 @@ def build_final_composite(
-        run(["ffmpeg", "-y", "-i", str(base_path), "-c", "copy", str(out_path)], quiet=True)
+        # No filters — copy video, encode the PCM intermediate audio to AAC for mp4
+        run(["ffmpeg", "-y", "-i", str(base_path), "-c:v", "copy",
+             "-c:a", "aac", "-b:a", "192k", "-ar", "48000",
+             "-movflags", "+faststart", str(out_path)], quiet=True)
         return
</file context>
Fix with cubic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant