Skip to content

fix: 修复 DeepSeek V4 Pro 推理模型 content 为空的问题#218

Open
Windelly wants to merge 5 commits into
lintsinghua:v3.0.0from
Windelly:fix/reasoning-content-fallback
Open

fix: 修复 DeepSeek V4 Pro 推理模型 content 为空的问题#218
Windelly wants to merge 5 commits into
lintsinghua:v3.0.0from
Windelly:fix/reasoning-content-fallback

Conversation

@Windelly

@Windelly Windelly commented May 4, 2026

Copy link
Copy Markdown

根因分析

DeepSeek V4 Pro 等推理模型会将推理过程放在 reasoning_content 字段,而 content 字段可能为空字符串。现有 LLM 适配器只读取 content,导致返回空结果,Agent 无法获取任何输出。

修复内容

1. LLMResponse 新增字段(types.py)

  • reasoning_content: Optional[str] - 存储推理内容
  • reasoning_tokens: int - 推理 token 估算值

2. 非流式路径 fallback(litellm_adapter.py)

  • 提取 reasoning_content(兼容 __dict__ 兜底)
  • content 为空且 reasoning_content 非空时,fallback 到 reasoning_content

3. 流式路径 fallback(litellm_adapter.py)

  • 累积 reasoning_content(从 delta 中提取)
  • done 事件:content 为空时 fallback
  • 异常结束(无 finish_reason):同上 fallback
  • done 事件中携带 reasoning_contentreasoning_tokens

4. Agent base 统计日志(base.py)

  • stream_llm_call 的 done 事件处理中,记录 reasoning 统计日志

兼容性

  • 对非推理模型(如 GPT-4、Claude)无影响:reasoning_content 默认为空,fallback 不触发
  • LLMResponse 新增字段有默认值,不破坏现有调用方
根因:
DeepSeek V4 Pro 等推理模型会将全部输出放在 reasoning_content 字段,
而 content 字段为空字符串。现有代码只读取 content,导致返回空结果。

修复内容:
- LLMResponse 新增 reasoning_content 和 reasoning_tokens 字段
- 非流式路径:content 为空时 fallback 到 reasoning_content
- 流式 done 事件:同上 fallback 逻辑
- 流式异常结束(无 finish_reason):同上 fallback 逻辑
- stream_llm_call 中记录 reasoning 统计日志
@vercel

vercel Bot commented May 4, 2026

Copy link
Copy Markdown

@Windelly is attempting to deploy a commit to the tsinghuaiiilove-2257's projects Team on Vercel.

A member of the Team first needs to authorize it.

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 4, 2026

Copy link
Copy Markdown

Review Summary by Qodo

(Agentic_describe updated until commit 3c0f8c0)

Fix DeepSeek V4 Pro reasoning model empty content issue

🐞 Bug fix ✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add reasoning_content and reasoning_tokens fields to LLMResponse for reasoning models
• Implement fallback logic to use reasoning_content when content is empty
• Apply fallback in both non-streaming and streaming paths
• Add keepalive events and logging for reasoning model diagnostics
• Reduce log verbosity in code analysis service
Diagram
flowchart LR
  A["LLM Response"] --> B{"Has content?"}
  B -->|Yes| C["Use content"]
  B -->|No| D{"Has reasoning_content?"}
  D -->|Yes| E["Fallback to reasoning_content"]
  D -->|No| F["Return empty"]
  C --> G["LLMResponse with reasoning fields"]
  E --> G
  F --> G
Loading

Grey Divider

File Changes

1. backend/app/services/llm/types.py ✨ Enhancement +2/-0

Add reasoning fields to LLMResponse

• Add reasoning_content: Optional[str] field to store reasoning process
• Add reasoning_tokens: int field with default value 0 for token estimation

backend/app/services/llm/types.py


2. backend/app/services/llm/adapters/litellm_adapter.py 🐞 Bug fix +61/-5

Implement reasoning content fallback in LLM adapter

• Extract reasoning_content from response message with fallback to __dict__
• Estimate tokens for reasoning content using estimate_tokens utility
• Implement fallback logic: use reasoning_content when content is empty in non-streaming path
• Accumulate reasoning content in streaming path and yield keepalive events
• Apply fallback logic in streaming done event and stream-ended-without-finish_reason cases
• Include reasoning_content and reasoning_tokens in done event chunks
• Add debug logging for reasoning content extraction and fallback events

backend/app/services/llm/adapters/litellm_adapter.py


3. backend/app/services/agent/agents/base.py ✨ Enhancement +9/-0

Add reasoning logging in agent stream handler

• Handle keepalive chunk type to track stream activity
• Extract and log reasoning_content and reasoning_tokens from done events
• Add debug logging for reasoning statistics in stream_llm_call

backend/app/services/agent/agents/base.py



4. backend/app/services/llm/service.py ✨ Enhancement +2/-2

Reduce log verbosity in code analysis

• Change LLM response logging from info to debug level
• Truncate long responses to first 500 characters in logs
• Reduce log noise for code analysis service

backend/app/services/llm/service.py


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 4, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0)

Grey Divider


Action required

1. Empty keepalive token spam🐞 Bug ➹ Performance
Description
LiteLLMAdapter.stream_complete 在每个 reasoning-only delta 上都会 yield 一个 content 为空的 token
chunk,AgentBase.stream_llm_call 会将其当作 token 并发出 thinking_token 事件。SSE 层对每个 thinking_token 强制 sleep
10ms,导致 reasoning-only 流式输出在高 chunk 数时被显著降速甚至触发超时/卡顿。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R398-408]

+                if reasoning:
+                    accumulated_reasoning += reasoning
+                    # Yield a keepalive chunk so downstream knows the stream is active
+                    # (reasoning-only models would otherwise cause a first-token timeout)
+                    if not content:
+                        yield {
+                            "type": "token",
+                            "content": "",
+                            "accumulated": accumulated_content,
+                        }
+
Evidence
适配器在 reasoning-only 场景会为每个 chunk 产生一个空 token;Agent 侧对任何 token chunk 都会 emit thinking_token;SSE 输出对
thinking_token 额外 sleep 10ms(用于拆包),因此空 token 的数量会线性转化为额外延迟。

backend/app/services/llm/adapters/litellm_adapter.py[392-408]
backend/app/services/agent/agents/base.py[1016-1045]
backend/app/services/agent/agents/base.py[700-709]
backend/app/api/v1/endpoints/agent_tasks.py[1929-1936]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LiteLLMAdapter.stream_complete()` 在 reasoning-only chunk 上通过 `yield {"type": "token", "content": ""}` 做 keepalive。该空 token 会在 `AgentBase.stream_llm_call()` 中触发 `emit_thinking_token()`,而 SSE 层对每个 `thinking_token` 都会 `sleep(0.01)`,导致大量 reasoning-only chunk 时流式性能急剧下降。
## Issue Context
Keepalive 的目的,是防止 adapter 丢弃 reasoning-only delta 后导致上游等待超时;但不应把每个 reasoning delta 都映射成一个 `thinking_token` SSE 事件。
## Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[392-408]
- backend/app/services/agent/agents/base.py[1016-1045]
## Suggested fix approach
- 方案 A(推荐):把 keepalive chunk 改成独立事件类型(例如 `{"type": "keepalive"}` 或 `{"type": "token", "keepalive": true}`),并在 `stream_llm_call` 中识别后仅更新 `first_token_received/last_activity`,不要调用 `emit_thinking_token`。
- 方案 B:对 keepalive 做节流(按时间间隔,例如每 1s 最多一次),确保不会因 chunk 数量过多产生大量空 thinking_token;同时保持间隔小于 `llm_stream_timeout` 以避免超时。

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Streaming reasoning causes timeout🐞 Bug ☼ Reliability
Description
LiteLLMAdapter.stream_complete 会累积 delta.reasoning_content,但只有在 delta.content 非空时才 yield token
chunk。对于仅流式输出 reasoning_content 的推理模型,下游 stream_llm_call 会一直 await __anext__ 并触发首
token/流式超时,即使上游实际在持续输出。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R392-403]

            delta = chunk.choices[0].delta
            content = getattr(delta, "content", "") or ""
+                # 🔥 提取 reasoning_content(推理模型如 DeepSeek V4 Pro)
+                reasoning = getattr(delta, "reasoning_content", "") or ""
            finish_reason = chunk.choices[0].finish_reason
+                if reasoning:
+                    accumulated_reasoning += reasoning
+
            if content:
                accumulated_content += content
                yield {
Evidence
stream_complete 中仅在 content 非空时 yield,reasoning 只会被累积不对外发出任何 chunk;而 Agent 侧对 async generator 的
__anext__ 使用 asyncio.wait_for 并依赖收到 type==token 才算“首 token 到达”,因此会在模型持续输出 reasoning 时仍然超时。

backend/app/services/llm/adapters/litellm_adapter.py[392-407]
backend/app/services/agent/agents/base.py[1013-1025]
backend/app/services/agent/agents/base.py[1019-1024]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LiteLLMAdapter.stream_complete()` accumulates `delta.reasoning_content` but does not `yield` anything unless `delta.content` is non-empty. For reasoning models that stream only `reasoning_content` (with empty `content`), downstream consumers (e.g. `Agent.stream_llm_call`) will block on `__anext__()` and can hit first-token/stream timeouts even though the upstream stream is active.
### Issue Context
`Agent.stream_llm_call` wraps `iterator.__anext__()` in `asyncio.wait_for()` with a shorter timeout before the first token arrives, and it only flips `first_token_received=True` on chunks with `type == "token"`.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[392-407]
- backend/app/services/agent/agents/base.py[1013-1056]
### What to change
Implement one of the following (preferred order):
1) **Emit reasoning deltas as streaming chunks**: when `reasoning` is non-empty, `yield` a chunk so downstream timeouts don’t fire. If you don’t want to expose reasoning to the UI, still emit a heartbeat-like chunk (e.g. `type: "token"` with empty `content` but some marker like `"reasoning_delta"`) and update the agent to treat it as activity without rendering.
2) **Add a new chunk type** (e.g. `"type": "reasoning"`) yielded on reasoning deltas, and update `stream_llm_call` to treat it as activity and set `first_token_received=True` (or otherwise bypass first-token timeout).
Ensure that any yielded chunk updates `last_activity` on the consumer side and prevents `asyncio.TimeoutError` during long reasoning phases.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Wrong token estimate order🐞 Bug ≡ Correctness
Description
In LiteLLMAdapter.stream_complete(), completion_tokens are estimated from accumulated_content before
potentially falling back to accumulated_reasoning, so reasoning-only responses can be returned with
completion_tokens incorrectly estimated as 0. This breaks token accounting/metrics (and any
budgeting based on them) for streaming responses when the provider doesn’t send usage.
Code

backend/app/services/llm/adapters/litellm_adapter.py[R422-428]

                 }
                 logger.debug(f"Estimated usage: {final_usage}")
-                    # 🔥 ENHANCED: 如果累积内容为空但有 finish_reason,记录警告
-                    if not accumulated_content:
+                    # 🔥 FALLBACK: 推理模型可能将全部输出放在 reasoning_content,content 为空
+                    if not accumulated_content.strip() and accumulated_reasoning.strip():
+                        logger.warning(f"[reasoning-fallback-stream] model={self.config.model}, content empty after {chunk_count} chunks, falling back to reasoning_content ({len(accumulated_reasoning)} chars)")
+                        accumulated_content = accumulated_reasoning
Evidence
The code estimates final_usage using accumulated_content, then later overwrites accumulated_content
via the reasoning fallback. If accumulated_content was empty (reasoning-only model) and final_usage
was missing, output_tokens_estimate becomes 0 even though the returned done.content becomes
non-empty after fallback.

backend/app/services/llm/adapters/litellm_adapter.py[411-444]
backend/app/services/llm/adapters/litellm_adapter.py[447-473]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`stream_complete()` estimates `final_usage` (completion tokens) using `accumulated_content` **before** applying the new reasoning fallback. When `content` is empty but `reasoning_content` exists and usage is not provided by the API, the estimated completion tokens become `0` even though the adapter returns non-empty content (reasoning) in the final `done` chunk.
### Issue Context
This happens in the `if finish_reason:` path: usage estimation precedes the fallback that mutates `accumulated_content`.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[411-444]
### Implementation notes
- Apply the fallback (decide the final text) *before* calling `estimate_tokens()`.
- Alternatively compute `text_for_estimate = accumulated_content if accumulated_content.strip() else accumulated_reasoning` and estimate based on that.
- Ensure `total_tokens` remains consistent with the chosen completion token estimate.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Info log noise reasoning stats🐞 Bug ◔ Observability
Description
AgentBase.stream_llm_call 在 done 事件里用 INFO 级别记录 reasoning_content 长度与估算
tokens,会在所有推理模型调用中产生稳定日志噪音并增加成本。该信息与 litellm_adapter 内已有的 DEBUG reasoning 日志重复,建议降级为 DEBUG 或可配置开关。
Code

backend/app/services/agent/agents/base.py[R1051-1055]

+                        # 🔥 记录推理模型的 reasoning 统计
+                        reasoning = chunk.get("reasoning_content", "")
+                        reasoning_tokens_est = chunk.get("reasoning_tokens", 0)
+                        if reasoning:
+                            logger.info(f"[{self.name}] reasoning_content: {len(reasoning)} chars, ~{reasoning_tokens_est} tokens")
Evidence
done 分支新增的 reasoning 统计日志使用 INFO 级别;而同 PR 中适配器侧 reasoning 统计均为 DEBUG,存在级别不一致且会放大生产日志量。

backend/app/services/agent/agents/base.py[1047-1056]
backend/app/services/llm/adapters/litellm_adapter.py[303-311]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`AgentBase.stream_llm_call()` 在收到 done chunk 时以 INFO 级别输出推理统计信息(reasoning chars / tokens)。这会对推理模型的每次调用都产生稳定日志噪音,并与 adapter 侧 DEBUG 日志重复。
## Issue Context
同一个 PR 中 `litellm_adapter` 对 reasoning 的统计使用的是 `logger.debug`,因此 base 层建议保持一致。
## Fix Focus Areas
- backend/app/services/agent/agents/base.py[1047-1056]
## Suggested fix approach
- 将 `logger.info(...)` 降级为 `logger.debug(...)`;或增加配置开关(例如仅在 debug/trace 模式输出)。

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Reasoning leaks via INFO logs🐞 Bug ⛨ Security
Description
非流式 fallback 会把 response.content 替换为 reasoning_content;而 LLMService.analyze_code 会在 INFO 级别整段打印
response.content。使用推理模型时这会把更长的推理输出写入日志,显著放大日志体积并扩大敏感输出暴露面。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R307-315]

+        # 🔥 FALLBACK: 推理模型(DeepSeek V4 Pro 等)可能将全部输出放在 reasoning_content,content 为空
+        final_content = choice.message.content or ""
+        if not final_content.strip() and reasoning_content.strip():
+            logger.info(f"[reasoning-fallback] model={response.model}, content empty, falling back to reasoning_content ({len(reasoning_content)} chars)")
+            final_content = reasoning_content
+
    return LLMResponse(
-            content=choice.message.content or "",
+            content=final_content,
        model=response.model,
Evidence
适配器在 content 为空时将 final_content 设为 reasoning_content,导致 LLMResponse.content 变为推理文本;同时 analyze_code
无条件在 INFO 打印完整 content,从而把推理文本落日志。

backend/app/services/llm/adapters/litellm_adapter.py[307-320]
backend/app/services/llm/service.py[389-395]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When `content` is blank, the adapter falls back to `reasoning_content` and assigns it to `LLMResponse.content`. `LLMService.analyze_code()` logs the full `content` at INFO, which will now include verbose reasoning output for reasoning models.
### Issue Context
This can dramatically increase log volume and potentially record sensitive model outputs.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[307-320]
- backend/app/services/llm/service.py[392-395]
### What to change
- In `LLMService.analyze_code`, change the full-content log to DEBUG or truncate/redact (e.g., log only length + first N chars).
- Optionally add a guard to avoid logging `reasoning_content`-derived output at INFO (e.g., if `response.reasoning_content` is present and `response.content == response.reasoning_content`, don’t print full text at INFO).
- Keep existing length-only logs if needed for diagnostics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Crude reasoning token estimate🐞 Bug ⚙ Maintainability
Description
reasoning_tokens 使用 len(text)//2 的硬编码粗估,而同文件其它 token 估算使用 estimate_tokens(model)。这会让
reasoning_tokens 与 usage/其它估算口径不一致,导致统计数据不可靠。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R303-306]

+        reasoning_tokens = len(reasoning_content) // 2 if reasoning_content else 0  # 粗估
+        if reasoning_content:
+            logger.info(f"[reasoning] model={response.model}, reasoning_chars={len(reasoning_content)}, est_tokens={reasoning_tokens}")
+
Evidence
非流式与流式 done 都用 len(... )//2 估算 reasoning tokens,但 streaming usage 估算使用
estimate_tokens(accumulated_content, model),同一模块存在两套口径。

backend/app/services/llm/adapters/litellm_adapter.py[296-306]
backend/app/services/llm/adapters/litellm_adapter.py[421-430]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`reasoning_tokens` is currently estimated via `len(reasoning_content) // 2`, which is inconsistent with the existing `estimate_tokens(text, model)` used elsewhere in the adapter.
### Issue Context
This makes `reasoning_tokens` systematically inaccurate and not comparable to other token stats.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[296-306]
- backend/app/services/llm/adapters/litellm_adapter.py[433-436]
- backend/app/services/llm/adapters/litellm_adapter.py[465-467]
### What to change
- Replace `len(reasoning_content)//2` and `len(accumulated_reasoning)//2` with `estimate_tokens(reasoning_content, self.config.model)` / `estimate_tokens(accumulated_reasoning, self.config.model)`.
- Keep the `0` fallback when reasoning is empty.
- Ensure the estimate happens after any fallback that changes which text is considered final.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



7. Noisy fallback warning logs🐞 Bug ◔ Observability
Description
The adapter logs WARNING on the normal successful reasoning-model path (content empty → fallback to
reasoning_content), which can flood logs and reduce warning signal-to-noise. This is likely to
happen frequently for models that routinely return empty content with non-empty reasoning_content.
Code

backend/app/services/llm/adapters/litellm_adapter.py[R303-311]

+        reasoning_tokens = len(reasoning_content) // 2 if reasoning_content else 0  # 粗估
+        if reasoning_content:
+            logger.info(f"[reasoning] model={response.model}, reasoning_chars={len(reasoning_content)}, est_tokens={reasoning_tokens}")
+
+        # 🔥 FALLBACK: 推理模型(DeepSeek V4 Pro 等)可能将全部输出放在 reasoning_content,content 为空
+        final_content = choice.message.content or ""
+        if not final_content.strip() and reasoning_content.strip():
+            logger.warning(f"[reasoning-fallback] model={response.model}, content empty, falling back to reasoning_content ({len(reasoning_content)} chars)")
+            final_content = reasoning_content
Evidence
New logs emit WARNING whenever fallback triggers in both non-stream and stream paths, even though
the request completes successfully and produces output; additionally INFO logs are emitted whenever
reasoning_content exists. These paths are expected to be hit repeatedly for reasoning models that
place output in reasoning_content.

backend/app/services/llm/adapters/litellm_adapter.py[296-311]
backend/app/services/llm/adapters/litellm_adapter.py[425-436]
backend/app/services/agent/agents/base.py[1047-1056]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Fallback from empty `content` to `reasoning_content` is currently logged at WARNING (and reasoning stats at INFO) on successful responses. For reasoning models where this is common, this will generate high-volume WARNING logs and make real warnings harder to spot.
### Issue Context
This affects both non-stream and stream fallback paths, and base agent logging of reasoning stats.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[296-311]
- backend/app/services/llm/adapters/litellm_adapter.py[425-436]
- backend/app/services/agent/agents/base.py[1047-1056]
### Implementation notes
- Change fallback logs to `logger.info` (or `debug`) when it’s an expected model behavior; reserve WARNING for truly anomalous cases (e.g., both `content` and `reasoning_content` empty).
- Consider sampling/rate-limiting reasoning stats logs (`[reasoning]`, `[reasoning-stream]`) to avoid high log volume.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 3c0f8c0

Results up to commit N/A


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)


Action required
1. Empty keepalive token spam🐞 Bug ➹ Performance
Description
LiteLLMAdapter.stream_complete 在每个 reasoning-only delta 上都会 yield 一个 content 为空的 token
chunk,AgentBase.stream_llm_call 会将其当作 token 并发出 thinking_token 事件。SSE 层对每个 thinking_token 强制 sleep
10ms,导致 reasoning-only 流式输出在高 chunk 数时被显著降速甚至触发超时/卡顿。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R398-408]

+                if reasoning:
+                    accumulated_reasoning += reasoning
+                    # Yield a keepalive chunk so downstream knows the stream is active
+                    # (reasoning-only models would otherwise cause a first-token timeout)
+                    if not content:
+                        yield {
+                            "type": "token",
+                            "content": "",
+                            "accumulated": accumulated_content,
+                        }
+
Evidence
适配器在 reasoning-only 场景会为每个 chunk 产生一个空 token;Agent 侧对任何 token chunk 都会 emit thinking_token;SSE 输出对
thinking_token 额外 sleep 10ms(用于拆包),因此空 token 的数量会线性转化为额外延迟。

backend/app/services/llm/adapters/litellm_adapter.py[392-408]
backend/app/services/agent/agents/base.py[1016-1045]
backend/app/services/agent/agents/base.py[700-709]
backend/app/api/v1/endpoints/agent_tasks.py[1929-1936]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LiteLLMAdapter.stream_complete()` 在 reasoning-only chunk 上通过 `yield {"type": "token", "content": ""}` 做 keepalive。该空 token 会在 `AgentBase.stream_llm_call()` 中触发 `emit_thinking_token()`,而 SSE 层对每个 `thinking_token` 都会 `sleep(0.01)`,导致大量 reasoning-only chunk 时流式性能急剧下降。
## Issue Context
Keepalive 的目的,是防止 adapter 丢弃 reasoning-only delta 后导致上游等待超时;但不应把每个 reasoning delta 都映射成一个 `thinking_token` SSE 事件。
## Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[392-408]
- backend/app/services/agent/agents/base.py[1016-1045]
## Suggested fix approach
- 方案 A(推荐):把 keepalive chunk 改成独立事件类型(例如 `{"type": "keepalive"}` 或 `{"type": "token", "keepalive": true}`),并在 `stream_llm_call` 中识别后仅更新 `first_token_received/last_activity`,不要调用 `emit_thinking_token`。
- 方案 B:对 keepalive 做节流(按时间间隔,例如每 1s 最多一次),确保不会因 chunk 数量过多产生大量空 thinking_token;同时保持间隔小于 `llm_stream_timeout` 以避免超时。

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Streaming reasoning causes timeout🐞 Bug ☼ Reliability
Description
LiteLLMAdapter.stream_complete 会累积 delta.reasoning_content,但只有在 delta.content 非空时才 yield token
chunk。对于仅流式输出 reasoning_content 的推理模型,下游 stream_llm_call 会一直 await __anext__ 并触发首
token/流式超时,即使上游实际在持续输出。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R392-403]

             delta = chunk.choices[0].delta
             content = getattr(delta, "content", "") or ""
+                # 🔥 提取 reasoning_content(推理模型如 DeepSeek V4 Pro)
+                reasoning = getattr(delta, "reasoning_content", "") or ""
             finish_reason = chunk.choices[0].finish_reason
+                if reasoning:
+                    accumulated_reasoning += reasoning
+
             if content:
                 accumulated_content += content
                 yield {
Evidence
stream_complete 中仅在 content 非空时 yield,reasoning 只会被累积不对外发出任何 chunk;而 Agent 侧对 async generator 的
__anext__ 使用 asyncio.wait_for 并依赖收到 type==token 才算“首 token 到达”,因此会在模型持续输出 reasoning 时仍然超时。

backend/app/services/llm/adapters/litellm_adapter.py[392-407]
backend/app/services/agent/agents/base.py[1013-1025]
backend/app/services/agent/agents/base.py[1019-1024]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`LiteLLMAdapter.stream_complete()` accumulates `delta.reasoning_content` but does not `yield` anything unless `delta.content` is non-empty. For reasoning models that stream only `reasoning_content` (with empty `content`), downstream consumers (e.g. `Agent.stream_llm_call`) will block on `__anext__()` and can hit first-token/stream timeouts even though the upstream stream is active.
### Issue Context
`Agent.stream_llm_call` wraps `iterator.__anext__()` in `asyncio.wait_for()` with a shorter timeout before the first token arrives, and it only flips `first_token_received=True` on chunks with `type == "token"`.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[392-407]
- backend/app/services/agent/agents/base.py[1013-1056]
### What to change
Implement one of the following (preferred order):
1) **Emit reasoning deltas as streaming chunks**: when `reasoning` is non-empty, `yield` a chunk so downstream timeouts don’t fire. If you don’t want to expose reasoning to the UI, still emit a heartbeat-like chunk (e.g. `type: "token"` with empty `content` but some marker like `"reasoning_delta"`) and update the agent to treat it as activity without rendering.
2) **Add a new chunk type** (e.g. `"type": "reasoning"`) yielded on reasoning deltas, and update `stream_llm_call` to treat it as activity and set `first_token_received=True` (or otherwise bypass first-token timeout).
Ensure that any yielded chunk updates `last_activity` on the consumer side and prevents `asyncio.TimeoutError` during long reasoning phases.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Wrong token estimate order🐞 Bug ≡ Correctness
Description
In LiteLLMAdapter.stream_complete(), completion_tokens are estimated from accumulated_content before
potentially falling back to accumulated_reasoning, so reasoning-only responses can be returned with
completion_tokens incorrectly estimated as 0. This breaks token accounting/metrics (and any
budgeting based on them) for streaming responses when the provider doesn’t send usage.
Code

backend/app/services/llm/adapters/litellm_adapter.py[R422-428]

                  }
                  logger.debug(f"Estimated usage: {final_usage}")
-                    # 🔥 ENHANCED: 如果累积内容为空但有 finish_reason,记录警告
-                    if not accumulated_content:
+                    # 🔥 FALLBACK: 推理模型可能将全部输出放在 reasoning_content,content 为空
+                    if not accumulated_content.strip() and accumulated_reasoning.strip():
+                        logger.warning(f"[reasoning-fallback-stream] model={self.config.model}, content empty after {chunk_count} chunks, falling back to reasoning_content ({len(accumulated_reasoning)} chars)")
+                        accumulated_content = accumulated_reasoning
Evidence
The code estimates final_usage using accumulated_content, then later overwrites accumulated_content
via the reasoning fallback. If accumulated_content was empty (reasoning-only model) and final_usage
was missing, output_tokens_estimate becomes 0 even though the returned done.content becomes
non-empty after fallback.

backend/app/services/llm/adapters/litellm_adapter.py[411-444]
backend/app/services/llm/adapters/litellm_adapter.py[447-473]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`stream_complete()` estimates `final_usage` (completion tokens) using `accumulated_content` **before** applying the new reasoning fallback. When `content` is empty but `reasoning_content` exists and usage is not provided by the API, the estimated completion tokens become `0` even though the adapter returns non-empty content (reasoning) in the final `done` chunk.
### Issue Context
This happens in the `if finish_reason:` path: usage estimation precedes the fallback that mutates `accumulated_content`.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[411-444]
### Implementation notes
- Apply the fallback (decide the final text) *before* calling `estimate_tokens()`.
- Alternatively compute `text_for_estimate = accumulated_content if accumulated_content.strip() else accumulated_reasoning` and estimate based on that.
- Ensure `total_tokens` remains consistent with the chosen completion token estimate.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
4. Info log noise reasoning stats🐞 Bug ◔ Observability
Description
AgentBase.stream_llm_call 在 done 事件里用 INFO 级别记录 reasoning_content 长度与估算
tokens,会在所有推理模型调用中产生稳定日志噪音并增加成本。该信息与 litellm_adapter 内已有的 DEBUG reasoning 日志重复,建��降级为 DEBUG 或可配置开关。
Code

backend/app/services/agent/agents/base.py[R1051-1055]

+                        # 🔥 记录推理模型的 reasoning 统计
+                        reasoning = chunk.get("reasoning_content", "")
+                        reasoning_tokens_est = chunk.get("reasoning_tokens", 0)
+                        if reasoning:
+                            logger.info(f"[{self.name}] reasoning_content: {len(reasoning)} chars, ~{reasoning_tokens_est} tokens")
Evidence
done 分支新增的 reasoning 统计日志使用 INFO 级别;而同 PR 中适配器侧 reasoning 统计均为 DEBUG,存在级别不一致且会放大生产日志量。

backend/app/services/agent/agents/base.py[1047-1056]
backend/app/services/llm/adapters/litellm_adapter.py[303-311]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`AgentBase.stream_llm_call()` 在收到 done chunk 时以 INFO 级别输出推理统计信息(reasoning chars / tokens)。这会对推理模型的每次调用都产生稳定日志噪音,并与 adapter 侧 DEBUG 日志重复。
## Issue Context
同一个 PR 中 `litellm_adapter` 对 reasoning 的统计使用的是 `logger.debug`,因此 base 层建议保持一致。
## Fix Focus Areas
- backend/app/services/agent/agents/base.py[1047-1056]
## Suggested fix approach
- 将 `logger.info(...)` 降级为 `logger.debug(...)`;或增加配置开关(例如仅在 debug/trace 模式输出)。

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Reasoning leaks via INFO logs🐞 Bug ⛨ Security
Description
非流式 fallback 会把 response.content 替换为 reasoning_content;而 LLMService.analyze_code 会在 INFO 级别整段打印
response.content。使用推理模型时这会把更长的推理输出写入日志,显著放大日志体积并扩大敏感输出暴露面。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R307-315]

+        # 🔥 FALLBACK: 推理模型(DeepSeek V4 Pro 等)可能将全部输出放在 reasoning_content,content 为空
+        final_content = choice.message.content or ""
+        if not final_content.strip() and reasoning_content.strip():
+            logger.info(f"[reasoning-fallback] model={response.model}, content empty, falling back to reasoning_content ({len(reasoning_content)} chars)")
+            final_content = reasoning_content
+
     return LLMResponse(
-            content=choice.message.content or "",
+            content=final_content,
         model=response.model,
Evidence
适配器在 content 为空时将 final_content 设为 reasoning_content,导致 LLMResponse.content 变为推理文本;同时 analyze_code
无条件在 INFO 打印完整 content,从而把推理文本落日志。

backend/app/services/llm/adapters/litellm_adapter.py[307-320]
backend/app/services/llm/service.py[389-395]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
When `content` is blank, the adapter falls back to `reasoning_content` and assigns it to `LLMResponse.content`. `LLMService.analyze_code()` logs the full `content` at INFO, which will now include verbose reasoning output for reasoning models.
### Issue Context
This can dramatically increase log volume and potentially record sensitive model outputs.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[307-320]
- backend/app/services/llm/service.py[392-395]
### What to change
- In `LLMService.analyze_code`, change the full-content log to DEBUG or truncate/redact (e.g., log only length + first N chars).
- Optionally add a guard to avoid logging `reasoning_content`-derived output at INFO (e.g., if `response.reasoning_content` is present and `response.content == response.reasoning_content`, don’t print full text at INFO).
- Keep existing length-only logs if needed for diagnostics.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Crude reasoning token estimate🐞 Bug ⚙ Maintainability
Description
reasoning_tokens 使用 len(text)//2 的硬编码粗估,而同文件其它 token 估算使用 estimate_tokens(model)。这会让
reasoning_tokens 与 usage/其它估算口径不一致,导致统计数据不可靠。
Code

backend/app/services/llm/adapters/litellm_adapter.py[R303-306]

+        reasoning_tokens = len(reasoning_content) // 2 if reasoning_content else 0  # 粗估
+        if reasoning_content:
+            logger.info(f"[reasoning] model={response.model}, reasoning_chars={len(reasoning_content)}, est_tokens={reasoning_tokens}")
+
Evidence
非流式与流式 done 都用 len(... )//2 估算 reasoning tokens,但 streaming usage 估算使用
estimate_tokens(accumulated_content, model),同一模块存在两套口径。

backend/app/services/llm/adapters/litellm_adapter.py[296-306]
backend/app/services/llm/adapters/litellm_adapter.py[421-430]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`reasoning_tokens` is currently estimated via `len(reasoning_content) // 2`, which is inconsistent with the existing `estimate_tokens(text, model)` used elsewhere in the adapter.
### Issue Context
This makes `reasoning_tokens` systematically inaccurate and not comparable to other token stats.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[296-306]
- backend/app/services/llm/adapters/litellm_adapter.py[433-436]
- backend/app/services/llm/adapters/litellm_adapter.py[465-467]
### What to change
- Replace `len(reasoning_content)//2` and `len(accumulated_reasoning)//2` with `estimate_tokens(reasoning_content, self.config.model)` / `estimate_tokens(accumulated_reasoning, self.config.model)`.
- Keep the `0` fallback when reasoning is empty.
- Ensure the estimate happens after any fallback that changes which text is considered final.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



7. Noisy fallback warning logs🐞 Bug ◔ Observability
Description
The adapter logs WARNING on the normal successful reasoning-model path (content empty → fallback to
reasoning_content), which can flood logs and reduce warning signal-to-noise. This is likely to
happen frequently for models that routinely return empty content with non-empty reasoning_content.
Code

backend/app/services/llm/adapters/litellm_adapter.py[R303-311]

+        reasoning_tokens = len(reasoning_content) // 2 if reasoning_content else 0  # 粗估
+        if reasoning_content:
+            logger.info(f"[reasoning] model={response.model}, reasoning_chars={len(reasoning_content)}, est_tokens={reasoning_tokens}")
+
+        # 🔥 FALLBACK: 推理模型(DeepSeek V4 Pro 等)可能将全部输出放在 reasoning_content,content 为空
+        final_content = choice.message.content or ""
+        if not final_content.strip() and reasoning_content.strip():
+            logger.warning(f"[reasoning-fallback] model={response.model}, content empty, falling back to reasoning_content ({len(reasoning_content)} chars)")
+            final_content = reasoning_content
Evidence
New logs emit WARNING whenever fallback triggers in both non-stream and stream paths, even though
the request completes successfully and produces output; additionally INFO logs are emitted whenever
reasoning_content exists. These paths are expected to be hit repeatedly for reasoning models that
place output in reasoning_content.

backend/app/services/llm/adapters/litellm_adapter.py[296-311]
backend/app/services/llm/adapters/litellm_adapter.py[425-436]
backend/app/services/agent/agents/base.py[1047-1056]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Fallback from empty `content` to `reasoning_content` is currently logged at WARNING (and reasoning stats at INFO) on successful responses. For reasoning models where this is common, this will generate high-volume WARNING logs and make real warnings harder to spot.
### Issue Context
This affects both non-stream and stream fallback paths, and base agent logging of reasoning stats.
### Fix Focus Areas
- backend/app/services/llm/adapters/litellm_adapter.py[296-311]
- backend/app/services/llm/adapters/litellm_adapter.py[425-436]
- backend/app/services/agent/agents/base.py[1047-1056]
### Implementation notes
- Change fallback logs to `logger.info` (or `debug`) when it’s an expected model behavior; reserve WARNING for truly anomalous cases (e.g., both `content` and `reasoning_content` empty).
- Consider sampling/rate-limiting reasoning stats logs (`[reasoning]`, `[reasoning-stream]`) to avoid high log volume.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread backend/app/services/llm/adapters/litellm_adapter.py Outdated
1. Token estimation order: move reasoning fallback before token estimation
   so estimates are based on final content (after fallback)
2. Reduce log noise: change 3 reasoning-fallback logger.warning to logger.info
@Windelly

Windelly commented May 5, 2026

Copy link
Copy Markdown
Author

@CodiumAI-Agent review

@Windelly

Windelly commented May 5, 2026

Copy link
Copy Markdown
Author

Closing and reopening to trigger Qodo re-review after addressing feedback.

@Windelly Windelly closed this May 5, 2026
@Windelly Windelly reopened this May 5, 2026
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 5, 2026

Copy link
Copy Markdown

Persistent review updated to latest commit daf0fad

Comment thread backend/app/services/llm/adapters/litellm_adapter.py
@Windelly

Windelly commented May 5, 2026

Copy link
Copy Markdown
Author

Round 2 fixes pushed. Reopening for Qodo re-review.

@Windelly Windelly closed this May 5, 2026
@Windelly Windelly reopened this May 5, 2026
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 5, 2026

Copy link
Copy Markdown

Persistent review updated to latest commit da283e1

Comment thread backend/app/services/llm/adapters/litellm_adapter.py
@Windelly

Windelly commented May 5, 2026

Copy link
Copy Markdown
Author

Round 3 fixes pushed. Reopening for Qodo.

@Windelly Windelly closed this May 5, 2026
@Windelly Windelly reopened this May 5, 2026
@qodo-free-for-open-source-projects

qodo-free-for-open-source-projects Bot commented May 5, 2026

Copy link
Copy Markdown

Persistent review updated to latest commit 3c0f8c0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant