feat(server): support ROCM for /api/usage endpoint#9773
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Pull request overview
This PR extends the server /api/usage endpoint GPU reporting to support AMD GPUs via ROCm (rocm-smi), alongside the existing NVIDIA (nvidia-smi) implementation, so the frontend can show GPU memory usage on ROCm systems.
Changes:
- Detects whether NVIDIA or ROCm tooling is available and selects the appropriate stats collector.
- Refactors GPU collection into dedicated parsing helpers for
nvidia-smiandrocm-smi. - Adds a ROCm command definition for fetching VRAM stats via CSV output.
| used_str = "0" | ||
| total = int(total_str) | ||
| used = int(used_str) | ||
| free = total - used |
| gpu_available = _is_gpu_available() | ||
| if gpu_available == "nvidia": | ||
| gpu_stats = _parse_nvidia_smi_stats() | ||
| elif gpu_available == "rocm": | ||
| gpu_stats = _parse_rocm_smi_stats() |
|
4547db8 to
702d180
Compare
| assert response.json()["gpu"] == [] | ||
|
|
||
|
|
||
| def test_usage_rocm_gpu(client: TestClient) -> None: |
There was a problem hiding this comment.
I would highly appreciate example output for nvidia-smi to add test for it
|
@mscolnick I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 2 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="marimo/_server/api/endpoints/health.py">
<violation number="1" location="marimo/_server/api/endpoints/health.py:452">
P2: ROCm GPU stats silently report all-zero memory when expected JSON keys are missing</violation>
</file>
Architecture diagram
sequenceDiagram
participant Client as Client
participant HealthAPI as /api/usage endpoint
participant GpuDetect as _is_gpu_available()
participant NvidiaSMI as nvidia-smi process
participant RocmSMI as rocm-smi process
participant GpuParser as GPU stats parser
Note over Client,GpuParser: GET /api/usage with ROCM GPU support
Client->>HealthAPI: GET /api/usage
HealthAPI->>HealthAPI: Collect CPU, memory, network stats
HealthAPI->>GpuDetect: Check GPU availability
alt No GPU tools found
GpuDetect-->>HealthAPI: return False
HealthAPI-->>Client: GPU stats = []
else NVIDIA GPU detected
GpuDetect->>NvidiaSMI: subprocess.run(_NVIDIA_GPU_STATS_CMD)
NvidiaSMI-->>GpuDetect: CSV stdout
GpuDetect-->>HealthAPI: return "nvidia"
HealthAPI->>GpuParser: _parse_nvidia_smi_stats()
GpuParser->>GpuParser: Parse CSV lines, handle [N/A]
GpuParser-->>HealthAPI: list of GPU dicts
else AMD ROCM GPU detected
GpuDetect->>RocmSMI: subprocess.run(_AMD_GPU_STATS_CMD)
RocmSMI-->>GpuDetect: JSON stdout (with possible WARNING prefix)
GpuDetect-->>HealthAPI: return "rocm"
HealthAPI->>GpuParser: _parse_rocm_smi_stats()
GpuParser->>GpuParser: Strip warning lines, parse JSON
Note over GpuParser: Extract card#, Card Series, VRAM bytes
alt JSON parse error
GpuParser-->>HealthAPI: return []
else Success
GpuParser-->>HealthAPI: list of GPU dicts
end
end
alt GPU process failure
HealthAPI->>HealthAPI: Log warning, continue
end
HealthAPI-->>Client: JSON response with GPU stats
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
|
@Set27 looks good. some failing CI tests |
1eba709 to
3c94bcc
Compare
I can't reproduce CLI failing running the same command locally; I guess I run intro transient, so I rebase on the latest main. |
|
thank you for this feature @Set27 ! |
I have read the CLA Document and I hereby sign the CLA
📝 Summary
Add AMD gpu stats supported
#9237
📋 Pre-Review Checklist
- [ ] For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).✅ Merge Checklist