Simplify LLM configuration layer (closes #504)#524
Conversation
LiteLLM's per-provider branches (deepseek, anthropic, groq, etc.) don't consult ``litellm.api_key`` (the module global Strix sets). They only check the per-call ``api_key`` kwarg and the ``<PROVIDER>_API_KEY`` env var. The SDK's LitellmModel passes ``api_key=None`` by default, so requests went out with an empty bearer and DeepSeek (and friends) returned 401. Mirror the user's LLM_API_KEY into the provider-specific env var (``DEEPSEEK_API_KEY`` for ``deepseek/...``, ``ANTHROPIC_API_KEY`` for ``anthropic/...``, etc.) using LiteLLM's documented convention. ``os.environ.setdefault`` is used so an explicit user env is never clobbered. The OpenAI branch was already working via ``set_default_openai_key`` + the existing ``litellm.api_key`` global fallback. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Greptile SummaryThis PR replaces the hand-rolled LLM configuration layer with a thin
Confidence Score: 5/5Safe to merge; all changes are well-scoped simplifications with no functional regressions on the main paths. The core routing change is straightforward and well-tested by the warm-up path. Key mirroring now delegates to litellm's own environment validation rather than a fragile hand-rolled map, fixing the previously flagged provider naming bugs. The two concerns raised are maintenance-oriented: the _supports_responses_custom_tools function uses supports_reasoning as a proxy (a reasonable assumption for current OpenAI models), and the shutdown-race guard matches on a string literal. Neither affects correctness today. strix/config/models.py — the _supports_responses_custom_tools coupling between supports_reasoning and Responses API capability detection is worth revisiting if new OpenAI models diverge from the current pattern. Important Files Changed
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
strix/config/models.py:125-130
**`supports_reasoning` used as a proxy for Responses API custom tool support**
`_supports_responses_custom_tools` returns `supports_reasoning` from litellm's model cost table, but the two capabilities are distinct. Today the correlation holds (o3/o4-mini have `supports_reasoning=True` and also accept native Responses API capabilities like `Filesystem`/`Shell`), but a new OpenAI model could support one without the other. If that happens, `uses_chat_completions_tool_schema` will silently mis-format tool definitions for the new model, sending Responses-API capability schemas when function-tool schemas are required (or vice versa), causing tool invocation failures at runtime with no obvious error at startup.
### Issue 2 of 2
strix/core/execution.py:359-361
**Fragile substring match on a runtime error message**
Catching `RuntimeError` by checking for `"after shutdown"` in the message string is fragile in two ways: if LiteLLM ever rewords the message (e.g. "cannot schedule new futures after interpreter shutdown"), the race exception will propagate as an unhandled error instead of being silently swallowed; and if some other `RuntimeError` happens to contain "after shutdown", it will be incorrectly suppressed. Consider matching against a specific exception type or a more structured sentinel if LiteLLM exposes one; alternatively, tightening the guard slightly reduces the blast radius.
```suggestion
except RuntimeError as stream_exc:
if not any(
phrase in str(stream_exc)
for phrase in ("after shutdown", "after interpreter shutdown")
):
raise
```
Reviews (6): Last reviewed commit: "Use function-tool schema for non-reasoni..." | Re-trigger Greptile |
normalize_model_name expands `claude-*` and `gemini-*` shorthands into `litellm/anthropic/...` and `litellm/gemini/...` at routing time, but the mirror helper was looking at the raw pre-normalization name — bare shorthands had no `/` and hit the early return, so ANTHROPIC_API_KEY / GEMINI_API_KEY were never populated for those users. Run the same normalization inside the mirror helper so the provider prefix is consistent with what LiteLLM actually sees downstream.
Naively uppercasing the routing prefix breaks for providers whose LiteLLM env var name doesn't match the prefix verbatim: together_ai/... needs TOGETHERAI_API_KEY (no underscore) perplexity/... needs PERPLEXITYAI_API_KEY Ask LiteLLM directly via litellm.validate_environment(model=...) which env vars it consults for the chosen provider, then setdefault each one to LLM_API_KEY. This is the SDK-blessed lookup and stays correct for every provider LiteLLM supports without a hand-maintained name map. Lowercase the routed model name before lookup so mixed-case user input (e.g. Together_AI/...) still resolves.
A bare model name without a provider prefix routes through the SDK's default OpenAI provider, so configuring STRIX_LLM=deepseek-v4-pro with LLM_API_KEY=<deepseek key> sends that key to api.openai.com and surfaces a confusing "Incorrect API key" error pointing at the OpenAI dashboard. When warm-up fails with an OpenAI-shaped error AND the configured model is still unprefixed after normalize_model_name, append a hint that points the user at the '<provider>/<model>' form with concrete examples.
When the user opts into reasoning_effort but the configured model
isn't in litellm.model_cost at all (private SKUs, fresh releases the
registry hasn't picked up — e.g. deepseek/deepseek-v4-pro), we can't
confirm thinking support and were sending tool_choice="required",
which thinking-mode endpoints reject ("Thinking mode does not support
this tool_choice").
Add model_known_to_registry() and split the decision: when the user
wants reasoning AND the model is either confirmed-reasoning OR
unknown-to-registry, drop tool_choice. The Reasoning(effort=...) param
still only attaches for confirmed-reasoning models, so we don't send
reasoning hints to known non-reasoning models.
Known non-reasoning models (gpt-4o, registry-confirmed) keep
tool_choice="required" unchanged.
Users had to type STRIX_LLM=litellm/deepseek/deepseek-chat — the litellm/ wrapper was Strix-internal plumbing surfacing in user config. Add StrixProvider, a MultiProvider subclass that routes any non-OpenAI prefix (deepseek/, anthropic/, groq/, xai/, mistral/, openrouter/, …) through LitellmProvider with the prefix preserved. normalize_model_name no longer adds litellm/ to anything; bare claude-* / gemini-* shorthands expand to anthropic/<model> / gemini/<model> instead of the wrapped form. Wire StrixProvider into warm_up_llm and RunConfig.model_provider. litellm/<provider>/<model> and any-llm/<provider>/<model> still resolve unchanged for users on older config. Refresh stale model names in the env-validation messages and the warm-up hint (gpt-5.4, claude-opus-4-7, deepseek-reasoner). Verified 24-case end-to-end matrix: OpenAI direct vs. LitellmProvider routing, env-var mirroring via validate_environment, supports_reasoning detection, and tool_choice gating all behave correctly across modern providers including the user's unknown DeepSeek SKU.
Drop every hand-rolled provider table and per-model gating that had
accumulated in the model-handling layer:
* normalize_model_name no longer auto-prefixes bare claude-* / gemini-*
names. Users supply the full <provider>/<model> form. The function
became literally model_name.strip(), so callers now inline that and
the function is removed.
* tool_choice="required" is gone everywhere. Thinking-mode endpoints
(Anthropic, DeepSeek /beta) reject it; modern reasoning models don't
need it; non-interactive runs already have
_append_noninteractive_tool_required_message as the convergence
backstop. model_supports_reasoning, model_known_to_registry, and
_model_cost_entry were only used to gate this and follow it out.
* Reasoning(effort=...) is now attached whenever
STRIX_REASONING_EFFORT is non-none. litellm.drop_params=True absorbs
it for non-reasoning models.
* Warm-up's bare-name OpenAI 401 hint is removed (false-positive prone,
relied on substring matching).
* reset_tool_choice on SandboxAgent is no-op now (no tool_choice gets
set) and is removed.
* report/dedupe.py was still routing through stock MultiProvider, so
non-OpenAI configs failed the dedupe LLM pass; switch it to
StrixProvider.
Verified end-to-end against modern provider strings (openai/gpt-5.4,
anthropic/claude-opus-4-7, deepseek/deepseek-reasoner,
gemini/gemini-2.5-pro, groq/, xai/, mistral/, together_ai/, perplexity/,
openrouter/, litellm/ legacy form, and whitespace-padded input): 18/18
cases route correctly, env vars mirror via litellm.validate_environment,
and ModelSettings carries no tool_choice. mypy strict passes.
Warn on bare unknown model names before warm-up. is_known_openai_bare_model consults litellm.model_cost and matches only entries whose litellm_provider == "openai". When the configured STRIX_LLM has no provider prefix, isn't a known OpenAI model, and no LLM_API_BASE is set, show a clear panel pointing the user at the <provider>/<model> form and exit before issuing the doomed request — no more chasing an "Incorrect API key" 401 from OpenAI when the user actually meant deepseek/, anthropic/, etc. Custom-base configs are still allowed through unconfirmed. Disable LiteLLM's message-logging and streaming-logging knobs to cut noise and skip one of the two end-of-stream submit paths. The other path at streaming_handler.py:2206 schedules work on a global ThreadPoolExecutor that loses to atexit shutdown when the interpreter is winding down; the SDK's stream consumer surfaces that as a fatal "cannot schedule new futures after shutdown" RuntimeError even though the actual stream content was already delivered. Catch and swallow that specific RuntimeError in _run_cycle so the scan isn't killed by an upstream end-of-stream logging race.
litellm.suppress_debug_info silences two unsolicited print() calls in LiteLLM core: the "Provider List: https://docs.litellm.ai/docs/providers" banner emitted by get_llm_provider_logic and the "Give Feedback / Get Help" + "If you need to debug this error, use litellm._turn_on_debug()" pair emitted by exception_mapping_utils on every LiteLLM exception. Both are unconditional print() calls, not logger output, so log-level config can't catch them. LiteLLM's own router and proxy_server set the same flag for the same reason.
OpenAI's Responses API rejects tools[i].type="custom" on non-reasoning models like gpt-4o (400 with code=unknown_parameter, param=tools). Strix's SDK-native Filesystem capability registers CustomTool entries by default, so a bare STRIX_LLM=gpt-4o run failed at the first tool invocation even though warm-up (a tool-less call) succeeded. uses_chat_completions_tool_schema now consults litellm.model_cost[<name>].supports_reasoning for OpenAI routes and flips to the chat-completions function-tool schema for models that don't carry the reasoning flag. Same registry-lookup pattern as is_known_openai_bare_model. Non-OpenAI prefixes and configs with LLM_API_BASE are unchanged (still function tools).
The two-bucket design (estimated + observed) shipped in #524 was over-engineered for what the user actually wants (one accurate total) and had a real sum-vs-total inconsistency flagged by Greptile when a provider populates input_tokens/output_tokens but leaves total_tokens unset. Collapse LLMUsageLedger to a single _total_cost. record() still skips the tokens-times-registry estimate for LiteLLM-routed models so we do not double-count the callback, but everything lands in one bucket. to_record() apportions the total to agents by token share, using _resolve_total_tokens (which falls back to input + output when total_tokens is 0). The denominator and per-agent numerators read from the same map, so per-agent shares now sum exactly to the total. The cost_source label is removed. litellm_cost_callback and its private extractor move from a separate report/cost_capture.py module into report/state.py, collapsed into a single self-contained adapter at the bottom of that file. cost_capture is deleted. ReportState gains a public record_observed_llm_cost(cost) method so the callback no longer reaches into report_state._llm_usage as a private attribute. hydrate() loads the persisted cost back into _total_cost with no attempt to bucket it as estimated vs observed, which removes the resume drift where a previously-observed run picked up a "mixed" label on resume.
Closes #504.
Strips the LLM-config layer to a thin shim over the OpenAI Agents SDK + LiteLLM. No hand-rolled provider tables, no shorthand expansion, no per-provider gating.
What changes
LLM_API_KEYis mirrored into the provider-specific env var (DEEPSEEK_API_KEY,ANTHROPIC_API_KEY,TOGETHERAI_API_KEY, ...) vialitellm.validate_environment. No local provider→env-var map.litellm/prefix in user config.StrixProvider(a 5-lineMultiProvidersubclass) routes any non-OpenAI prefix through LiteLLM. Users typedeepseek/deepseek-v4-pro. Oldlitellm/...configs still work.tool_choice="required". Modern thinking models (Anthropic extended thinking, DeepSeek/beta, Bedrock newer Claude) reject it; convergence relies on_finish_tool_use_behavior+_append_noninteractive_tool_required_message(already in place).<provider>/<model>. Bare unknown names route to the SDK's OpenAI default; pre-warm-up panel warns when the bare name isn't a known OpenAI model and exits before the doomed call.Side stability fixes
RuntimeError: cannot schedule new futures after shutdownrace (backgroundsuccess_handler.submitlosing to atexit during interpreter teardown — stream content was already delivered).suppress_debug_info,disable_streaming_logging,turn_off_message_logging.StrixProvider(was stockMultiProvider, broke non-OpenAI configs).