Skip to content

feat: SRT-driven edit pipeline + edit-plan recommender#41

Open
xiaogang-sudo wants to merge 3 commits into
browser-use:mainfrom
xiaogang-sudo:feat/srt-driven-edit
Open

feat: SRT-driven edit pipeline + edit-plan recommender#41
xiaogang-sudo wants to merge 3 commits into
browser-use:mainfrom
xiaogang-sudo:feat/srt-driven-edit

Conversation

@xiaogang-sudo

@xiaogang-sudo xiaogang-sudo commented May 19, 2026

Copy link
Copy Markdown

Summary

Adds an independent SRT-driven editing pipeline plus a lexical recommender that bridges Scribe transcripts to it. All existing helpers (render.py, grade.py, transcribe.py, etc.) are untouched.

  • helpers/srt_driven_edit.py — full extract → gap → concat → final-compose pipeline:

    • safe-ASCII temp work dir so CJK / quoted user paths never reach libavfilter
    • SRT encoding fallback (utf-8-sig / utf-8 / gb18030 / cp936 / cp1252) and cue-settings tolerance (position:90% etc.)
    • ffmpeg + ffprobe preflight; per-source ffprobe with auto-degrade when source has no audio
    • sync tails (fps=24,setpts=PTS-STARTPTS / aresample=async=1:first_pts=0,asetpts) on every clip
    • per-segment cache keyed by ffmpeg version + encoding params + effective bg_volume
    • global --voice spans the whole output timeline (mixed in the final compose, not per-segment)
    • batch manifest (jobs.json / .csv) with auto-isolated outputs, --continue-on-error, --no-overwrite
    • QC report (per-segment drift, audio mode, disk usage, subtitle style)
    • subtitle burn LAST (Hard Rule 1), 30ms audio fades (Hard Rule 3)
  • helpers/recommend_edit_plan.py — bridges Scribe transcript JSON → edit_plan.json:

    • candidate segmentation by sentence-end punctuation / silence gap / speaker change; phrase / hard window splits for long candidates
    • local lexical scoring: 0.6 SequenceMatcher + 0.4 Jaccard (token-level for Latin, char-2-gram for CJK) blended with duration similarity
    • greedy assignment (no reuse by default)
    • emits Form A (default, drop-in for srt_driven_edit --plan) or Form B; sidecar *_review.md for human QA
    • no LLM, no API; intentionally local. The matcher cannot understand storyline — low-score matches are flagged in the review markdown
    • --packed / --context-window flags reserved as placeholders (documented as such)
  • tests/ — 28 pytest tests using lavfi-synthesized media:

    • 9 e2e tests (basic, GBK SRT, CJK output path, per-segment voice, video-only source auto-degrade, range out-of-bounds, cache hit on rerun, --no-overwrite, gap insertion)
    • 3 global-voice tests including a regression that proves segments cache independently of the global voice
    • 5 batch tests (auto-isolation, continue-on-error, hard abort, CSV manifest, per-job bg_volume cache distinctness)
    • 11 recommender tests including a full chain: recommend → sde.run_job → final.mp4
  • pyproject.toml — adds dev = ["pytest>=7"] as an optional dependency.

  • CLAUDE.md / AGENTS.md — project guidance for AI assistants working in this repo. Happy to remove or reword if these conflict with upstream framing.

Pipeline position

script.srt + transcript.json
  --(recommend_edit_plan.py)-->
edit_plan.json + edit_plan_review.md
  --(srt_driven_edit.py)-->
final.mp4

This complements the existing transcript-first EDL flow rather than replacing it — use the new pipeline when you already have a finished narration script and want to align it to a source recording.

Reviewer notes

  • Pure additive; no existing helper modified.
  • srt_driven_edit.py reuses 4 symbols from render.py via try: from render import ... with fallbacks, so it still runs if render.py is unavailable.
  • Tests need ffmpeg + ffprobe on PATH; conftest.py skips the whole tests/ directory otherwise.
  • The smoke test at examples/srt_driven/_smoke_test.py is a no-pytest fallback that covers the parser / encoding / cache-key layers.
  • All ffmpeg pipeline behavior verified end-to-end on Windows with ffmpeg 8.1.1 + Python 3.12; should be portable since the code uses only stdlib + ffmpeg subprocesses.

Test plan

  • pip install -e ".[dev]"
  • python -m pytest tests/ -v (~40s on a typical machine)
  • Optional offline check: python examples/srt_driven/_smoke_test.py

🤖 Generated with Claude Code


Summary by cubic

Add a standalone SRT‑driven edit pipeline and a local edit‑plan recommender to align finished scripts to source footage without changing existing helpers. Enables an offline flow: script.srt + transcript.json → edit_plan.json → final.mp4.

  • New Features

    • helpers/srt_driven_edit.py: End‑to‑end SRT‑driven pipeline (parse + validate + align → cached extract → gap insert → concat → final compose with global --voice), with safe‑ASCII temp paths, SRT encoding fallback, ffmpeg/ffprobe preflight, batch manifests, QC report, and “burn subtitles last” with short audio fades.
    • helpers/recommend_edit_plan.py: Builds edit_plan.json from script.srt and Scribe word‑level transcript via local lexical scoring and greedy assignment; outputs Form A/B and a *_review.md for QA.
  • Dependencies

    • Add dev extra: pytest>=7.

Written for commit 87439d1. Summary will update on new commits. Review in cubic

xiaogang-sudo and others added 3 commits May 19, 2026 21:00
Independent helper that assembles a final cut by aligning source ranges
to an SRT timeline, bypassing the existing transcript-based EDL flow.
Use when you have a finished script (script.srt = final captions
timeline) and a list of source ranges keyed by SRT id.

Pipeline: parse SRT + plan -> strict validate -> align -> extract
segments (per-source ffprobe, HDR tone-map, sync tails, cache) -> gap
clips for non-contiguous SRT cues -> lossless concat -> final pass with
optional global voice mix + subtitle burn LAST (Hard Rule 1).

Key correctness properties:
- All intermediates land in a safe-ASCII temp work_dir; CJK / quoted
  user paths never reach libavfilter or the concat demuxer.
- SRT input decoded with utf-8-sig / utf-8 / gb18030 / cp936 / cp1252
  fallback; cue settings (position:90% etc.) tolerated.
- Per-segment cache keyed by ffmpeg version + encoding params +
  effective bg_volume so encoder tweaks invalidate stale clips.
- Source streams probed once; no-audio source auto-degrades bg_volume
  to 0 for its segments; out-of-bounds ranges fail fast.
- Global --voice spans the whole timeline (apad/atrim to total_duration
  in the final compose), not per-segment — a 5s VO does not restart at
  every cut.
- 30ms audio fades + fps=24,setpts and aresample sync tails on every
  segment prevent A/V drift through many short concats.
- burn_subtitles is self-defending: unsafe subs paths are copied to a
  temp ASCII SRT before being fed to libavfilter.
- Batch (jobs.json / .csv) auto-isolates outputs by manifest index;
  --continue-on-error skips failing rows; --no-overwrite refuses to
  clobber existing outputs.

Includes examples (Form A array, Form B object with multi-source +
voices, batch manifest, CJK SRT) and pytest coverage (14 e2e + batch
tests using lavfi-synthesized media; passes against ffmpeg 8.x on
Windows).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cript

Bridges the gap between Scribe word-level transcripts and the
srt_driven_edit pipeline. Given a final-cut script.srt and a source
recording's Scribe JSON, produces an edit_plan.json (Form A or B) plus
a sidecar review markdown for human-in-the-loop QA.

Matching strategy is intentionally local (no LLM, no API):
  1. Filter the transcript to timestamped 'word' tokens (audio_event /
     spacing skipped; --keep-audio-events keeps markers as context).
  2. Group consecutive words into non-overlapping candidates, breaking
     on sentence-end punctuation, silences >= gap_threshold, or speaker
     change. Long candidates split at phrase punctuation, then by hard
     word-level windows. All edges land on word boundaries.
  3. Score each (cue, candidate) pair as
       0.7 * (0.6 * SequenceMatcher + 0.4 * Jaccard)
       + 0.3 * 1/(1+|dur_delta|/cue_dur)
     where Jaccard auto-switches between Latin word-token and CJK
     character-bigram representations.
  4. Greedy assignment; --allow-reuse drops the no-reuse constraint.
  5. Emit Form A (default, drop-in for srt_driven_edit --plan) or Form
     B; review markdown lists matched text, score, duration delta, and
     warnings (low score / duration mismatch / candidate-shorter-than-
     cue).

Hard failure modes (exit 1): any cue with no assignable candidate;
malformed transcript JSON; transcript with no word tokens.
Soft failures (warnings only): low score, candidate too short for cue.

The matcher cannot understand storyline — if SRT narration words do
not appear in the source transcript, scores will be low. The sidecar
review.md is the manual QA surface; it is intentionally not pulled
into the plan (parse_plan in srt_driven_edit stays strict).

--packed (takes_packed.md) and --context-window flags are reserved
placeholders only; both raise no error but do not yet alter behavior.

Includes 11 pytest tests including a full end-to-end:
recommend -> sde.run_job -> final.mp4 against lavfi-synthesized media.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CLAUDE.md is auto-loaded by Claude Code when working in this directory,
giving sessions a consistent picture of the project's scope, tech
constraints, and out-of-bounds behaviors before the user has to say it.

AGENTS.md does the same for Codex review sessions, classifying review
output into must-fix / should-improve / later so suggestions are
actionable rather than open-ended rewrites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="helpers/recommend_edit_plan.py">

<violation number="1" location="helpers/recommend_edit_plan.py:134">
P2: `--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</violation>
</file>

<file name="helpers/srt_driven_edit.py">

<violation number="1" location="helpers/srt_driven_edit.py:553">
P2: Per-segment voice files lack preflight audio stream validation</violation>

<violation number="2" location="helpers/srt_driven_edit.py:771">
P1: Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

vf_parts: list[str] = []
if is_hdr_source(seg.source_path):
vf_parts.append(TONEMAP_CHAIN)
vf_parts.append(scale_filter_for(seg.source_path))

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Per-segment orientation scaling conflicts with concat demuxer -c copy stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 771:

<comment>Per-segment orientation scaling conflicts with concat demuxer `-c copy` stream-copy requirement, which mandates identical stream parameters (including resolution) across all inputs. In Form B multi-source plans, mixing portrait and landscape sources produces clips with mismatched resolutions (e.g. 1080×1920 vs 1920×1080), causing ffmpeg concat failure or glitched output.</comment>

<file context>
@@ -0,0 +1,1522 @@
+    vf_parts: list[str] = []
+    if is_hdr_source(seg.source_path):
+        vf_parts.append(TONEMAP_CHAIN)
+    vf_parts.append(scale_filter_for(seg.source_path))
+
+    if seg.pad_short and seg.plan_src_dur + 1e-6 < target:
</file context>
Fix with Cubic
return out


def build_candidates(

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: --keep-audio-events / keep_audio_events is dead code: audio events kept in load_transcript_words are silently discarded in build_candidates' unconditional type != "word" filter. The flag produces identical output in both states, misleading users who expect (laughter)/(applause) context to be included in candidate text.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/recommend_edit_plan.py, line 134:

<comment>`--keep-audio-events` / `keep_audio_events` is dead code: audio events kept in `load_transcript_words` are silently discarded in `build_candidates`' unconditional `type != "word"` filter. The flag produces identical output in both states, misleading users who expect `(laughter)`/`(applause)` context to be included in candidate text.</comment>

<file context>
@@ -0,0 +1,561 @@
+    return out
+
+
+def build_candidates(
+    words: list[dict],
+    *,
</file context>
Fix with Cubic
raise SystemExit(f"source '{name}' missing on disk: {sp}")
for name, vp in voices_map.items():
if not vp.exists():
raise SystemExit(f"voice '{name}' missing on disk: {vp}")

@cubic-dev-ai cubic-dev-ai Bot May 19, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Per-segment voice files lack preflight audio stream validation

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At helpers/srt_driven_edit.py, line 553:

<comment>Per-segment voice files lack preflight audio stream validation</comment>

<file context>
@@ -0,0 +1,1522 @@
+            raise SystemExit(f"source '{name}' missing on disk: {sp}")
+    for name, vp in voices_map.items():
+        if not vp.exists():
+            raise SystemExit(f"voice '{name}' missing on disk: {vp}")
+    if legacy_default_source is not None and not legacy_default_source.exists():
+        raise SystemExit(f"--source missing on disk: {legacy_default_source}")
</file context>
Fix with Cubic
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant