perf: optimize hot paths in apply_writes, RunnableCallable, hash functions, and add_messages#6969
Open
John Kennedy (jkennedyvz) wants to merge 8 commits intomainfrom
Open
perf: optimize hot paths in apply_writes, RunnableCallable, hash functions, and add_messages#6969John Kennedy (jkennedyvz) wants to merge 8 commits intomainfrom
John Kennedy (jkennedyvz) wants to merge 8 commits intomainfrom
Conversation
Add --profile flag to bench/__main__.py that bypasses pyperf and runs each benchmark under cProfile, printing per-benchmark hotspot summaries and writing .prof files for later analysis. Add benchmark-profile and benchmark-profile-spy Makefile targets. Fix O(n^2) performance regression in _get_model_input_state where f-strings eagerly evaluated repr(state) on every call, triggering pydantic __repr__ across all accumulated messages. Move error message construction into the error path so repr is only called when needed. This yields a 3-5x speedup on react_agent_100x benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `put_writes` method was scanning all channels with `any(isinstance(ch, UntrackedValue) ...)` on every call. For the sequential_1000 benchmark this produced 1M+ isinstance calls through the ABC machinery, consuming 40% of total runtime. Cache the result as `_has_untracked_channels` once in __enter__ and __aenter__, replacing both scan sites (put_writes and checkpoint sanitization). This yields a 1.8-2.3x speedup on sequential_1000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add noqa for intentional E402 imports after --profile guard in bench/__main__.py. Add _has_untracked_channels type annotation to PregelLoop class for mypy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ites Maintain a set of currently-available channel names, updated incrementally as channels change state, so the step-bump loop in apply_writes only iterates available channels instead of scanning all channels with is_available(). For sequential_1000 this reduces function calls by ~54% and improves overall runtime by ~26%. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tten task_path_str Three micro-optimizations targeting disproportionate call counts: 1. Cache inspect.signature results in RunnableCallable.__init__ — avoids repeated signature introspection for the same function (e.g. 1000 ChannelWrite instances all inspecting the same _write/_awrite methods). 2. Remove per-element isinstance check in _xxhash_str/_uuid5_str — all call sites pass string parts, so encode() directly without checking. 3. Flatten task_path_str to avoid recursive calls for the common case of tuple elements being str or int (not nested tuples). Reduces isinstance calls from ~43K to ~38K and total function calls by ~8K for sequential_1000. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The cast(BaseMessageChunk, m) calls in add_messages were no-ops at runtime but accounted for ~92K function calls per react_agent_100x benchmark run (~3ms overhead). Remove them since message_chunk_to_message already handles the type internally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
29e8041 to
cc9ae7c
Compare
Move --profile flag and Makefile targets (benchmark-profile, benchmark-profile-spy) to a separate patch for a future PR. This PR now contains only runtime performance optimizations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…writes Replace 5 repeated inline blocks that sync available_channels with a local _track() helper that checks is_available() and updates the set, returning the availability bool for callers that also need to update updated_channels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Performance optimizations targeting hot paths identified via cProfile profiling of the benchmark suite. Focuses on eliminating redundant work in the graph execution loop.
apply_writes— replace O(n) scan of all channels callingis_available()every step with an incrementally maintained_available_channels: set[str]onPregelLoop. Forsequential_1000(~1003 channels, ~1000 steps), this eliminates 1M+is_available()calls. The set is updated as a side-effect ofconsume(),update(), andfinish()via a local_track()helper.UntrackedValueisinstance scan —any(isinstance(ch, UntrackedValue) ...)scanned all channels on everyput_writescall. Cached as_has_untracked_channelsbool once in__enter__/__aenter__.inspect.signatureinRunnableCallable.__init__— the same functions (e.g.,ChannelWrite._write) were inspected thousands of times. Added a module-level_SIGNATURE_CACHEdict keyed by function object with graceful fallback for unhashable callables.isinstancefrom hash functions —_xxhash_strand_uuid5_strcheckedisinstance(p, str)for every part despite all call sites always passing strings. Narrowed type signature fromstr | bytestostrand removed the check.task_path_str— was fully recursive, now iterates elements directly and only recurses for nested tuples (rare case). Saves function call overhead for the common case.typing.castinadd_messages— eliminated ~92K no-opcast()function calls perreact_agent_100xrun.Test plan
make lintclean (ruff check, ruff format, mypy)🤖 Generated with Claude Code