Skip to content

perf: optimize hot paths in apply_writes, RunnableCallable, hash functions, and add_messages#6969

Open
John Kennedy (jkennedyvz) wants to merge 8 commits intomainfrom
jk/optimizations
Open

perf: optimize hot paths in apply_writes, RunnableCallable, hash functions, and add_messages#6969
John Kennedy (jkennedyvz) wants to merge 8 commits intomainfrom
jk/optimizations

Conversation

@jkennedyvz
Copy link
Contributor

@jkennedyvz John Kennedy (jkennedyvz) commented Feb 28, 2026

Summary

Performance optimizations targeting hot paths identified via cProfile profiling of the benchmark suite. Focuses on eliminating redundant work in the graph execution loop.

  • Track available channels incrementally in apply_writes — replace O(n) scan of all channels calling is_available() every step with an incrementally maintained _available_channels: set[str] on PregelLoop. For sequential_1000 (~1003 channels, ~1000 steps), this eliminates 1M+ is_available() calls. The set is updated as a side-effect of consume(), update(), and finish() via a local _track() helper.
  • Cache UntrackedValue isinstance scanany(isinstance(ch, UntrackedValue) ...) scanned all channels on every put_writes call. Cached as _has_untracked_channels bool once in __enter__/__aenter__.
  • Cache inspect.signature in RunnableCallable.__init__ — the same functions (e.g., ChannelWrite._write) were inspected thousands of times. Added a module-level _SIGNATURE_CACHE dict keyed by function object with graceful fallback for unhashable callables.
  • Remove isinstance from hash functions_xxhash_str and _uuid5_str checked isinstance(p, str) for every part despite all call sites always passing strings. Narrowed type signature from str | bytes to str and removed the check.
  • Flatten task_path_str — was fully recursive, now iterates elements directly and only recurses for nested tuples (rare case). Saves function call overhead for the common case.
  • Remove unnecessary typing.cast in add_messages — eliminated ~92K no-op cast() function calls per react_agent_100x run.

Test plan

  • Full test suite passes (1036 passed, 4 skipped)
  • make lint clean (ruff check, ruff format, mypy)

🤖 Generated with Claude Code

Add --profile flag to bench/__main__.py that bypasses pyperf and runs
each benchmark under cProfile, printing per-benchmark hotspot summaries
and writing .prof files for later analysis. Add benchmark-profile and
benchmark-profile-spy Makefile targets.

Fix O(n^2) performance regression in _get_model_input_state where
f-strings eagerly evaluated repr(state) on every call, triggering
pydantic __repr__ across all accumulated messages. Move error message
construction into the error path so repr is only called when needed.
This yields a 3-5x speedup on react_agent_100x benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `put_writes` method was scanning all channels with
`any(isinstance(ch, UntrackedValue) ...)` on every call. For the
sequential_1000 benchmark this produced 1M+ isinstance calls through
the ABC machinery, consuming 40% of total runtime.

Cache the result as `_has_untracked_channels` once in __enter__ and
__aenter__, replacing both scan sites (put_writes and checkpoint
sanitization). This yields a 1.8-2.3x speedup on sequential_1000.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add noqa for intentional E402 imports after --profile guard in
bench/__main__.py. Add _has_untracked_channels type annotation to
PregelLoop class for mypy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ites

Maintain a set of currently-available channel names, updated incrementally
as channels change state, so the step-bump loop in apply_writes only
iterates available channels instead of scanning all channels with
is_available(). For sequential_1000 this reduces function calls by ~54%
and improves overall runtime by ~26%.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tten task_path_str

Three micro-optimizations targeting disproportionate call counts:

1. Cache inspect.signature results in RunnableCallable.__init__ — avoids
   repeated signature introspection for the same function (e.g. 1000
   ChannelWrite instances all inspecting the same _write/_awrite methods).

2. Remove per-element isinstance check in _xxhash_str/_uuid5_str — all
   call sites pass string parts, so encode() directly without checking.

3. Flatten task_path_str to avoid recursive calls for the common case
   of tuple elements being str or int (not nested tuples).

Reduces isinstance calls from ~43K to ~38K and total function calls by
~8K for sequential_1000.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cast(BaseMessageChunk, m) calls in add_messages were no-ops at
runtime but accounted for ~92K function calls per react_agent_100x
benchmark run (~3ms overhead). Remove them since message_chunk_to_message
already handles the type internally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move --profile flag and Makefile targets (benchmark-profile,
benchmark-profile-spy) to a separate patch for a future PR.
This PR now contains only runtime performance optimizations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes

Replace 5 repeated inline blocks that sync available_channels with
a local _track() helper that checks is_available() and updates the
set, returning the availability bool for callers that also need to
update updated_channels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jkennedyvz John Kennedy (jkennedyvz) changed the title perf: fix quadratic bottlenecks and add benchmark profiling mode Feb 28, 2026
@jkennedyvz
Copy link
Contributor Author

John Kennedy (jkennedyvz) commented Feb 28, 2026

From CI

Benchmark main changes
react_agent_100x_sync 719 ms 175 ms: 4.11x faster
react_agent_100x_checkpoint_sync 722 ms 183 ms: 3.96x faster
react_agent_100x 756 ms 214 ms: 3.53x faster
react_agent_100x_checkpoint 764 ms 219 ms: 3.48x faster
sequential_1000_sync 373 ms 155 ms: 2.41x faster
sequential_1000 439 ms 218 ms: 2.01x faster
react_agent_10x_checkpoint_sync 22.4 ms 14.5 ms: 1.54x faster
react_agent_10x_sync 20.7 ms 13.6 ms: 1.52x faster
react_agent_10x_checkpoint 26.6 ms 18.4 ms: 1.44x faster
react_agent_10x 25.2 ms 17.8 ms: 1.42x faster
wide_state_25x300_sync 10.4 ms 9.39 ms: 1.11x faster
pydantic_state_9x1200_checkpoint_sync 61.7 ms 55.7 ms: 1.11x faster
pydantic_state_25x300_sync 18.6 ms 16.8 ms: 1.10x faster
wide_dict_25x300_sync 10.2 ms 9.26 ms: 1.10x faster
pydantic_state_9x1200_checkpoint 67.6 ms 61.4 ms: 1.10x faster
fanout_to_subgraph_10x_checkpoint 34.5 ms 31.4 ms: 1.10x faster
fanout_to_subgraph_10x 32.8 ms 29.9 ms: 1.10x faster
pydantic_state_15x600_checkpoint_sync 71.7 ms 65.7 ms: 1.09x faster
wide_dict_15x600_sync 13.2 ms 12.2 ms: 1.08x faster
pydantic_state_15x600_checkpoint 77.0 ms 71.4 ms: 1.08x faster
pydantic_state_25x300_checkpoint 44.0 ms 40.8 ms: 1.08x faster
pydantic_state_25x300_checkpoint_sync 39.6 ms 37.0 ms: 1.07x faster
pydantic_state_25x300 22.8 ms 21.4 ms: 1.07x faster
fanout_to_subgraph_100x_sync 297 ms 280 ms: 1.06x faster
fanout_to_subgraph_100x_checkpoint_sync 317 ms 301 ms: 1.05x faster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant