Skip to content

feat(server): support ROCM for /api/usage endpoint#9773

Merged
mscolnick merged 5 commits into
marimo-team:mainfrom
Set27:add-support-to-show-rocm-stats
Jun 11, 2026
Merged

feat(server): support ROCM for /api/usage endpoint#9773
mscolnick merged 5 commits into
marimo-team:mainfrom
Set27:add-support-to-show-rocm-stats

Conversation

@Set27

@Set27 Set27 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

I have read the CLA Document and I hereby sign the CLA

📝 Summary

Add AMD gpu stats supported
#9237

📋 Pre-Review Checklist

- [ ] For large changes, or changes that affect the public API: this change was discussed or approved through an issue, on Discord, or the community discussions (Please provide a link if applicable).

  • Any AI generated code has been reviewed line-by-line by the human PR author, who stands by it.
  • Video or media evidence is provided for any visual changes (optional).
image

✅ Merge Checklist

  • I have read the contributor guidelines.
  • Tests have been added for the changes made.
  • [not sure if any] Documentation has been updated where applicable, including docstrings for API changes.
@vercel

vercel Bot commented Jun 3, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Jun 11, 2026 3:21pm

Request Review

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@Set27

Set27 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@Set27 Set27 marked this pull request as draft June 3, 2026 09:52
@mscolnick mscolnick requested a review from Copilot June 3, 2026 12:59

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the server /api/usage endpoint GPU reporting to support AMD GPUs via ROCm (rocm-smi), alongside the existing NVIDIA (nvidia-smi) implementation, so the frontend can show GPU memory usage on ROCm systems.

Changes:

  • Detects whether NVIDIA or ROCm tooling is available and selects the appropriate stats collector.
  • Refactors GPU collection into dedicated parsing helpers for nvidia-smi and rocm-smi.
  • Adds a ROCm command definition for fetching VRAM stats via CSV output.
Comment thread marimo/_server/api/endpoints/health.py
Comment thread marimo/_server/api/endpoints/health.py Outdated
Comment thread marimo/_server/api/endpoints/health.py Outdated
used_str = "0"
total = int(total_str)
used = int(used_str)
free = total - used
Comment on lines +279 to +283
gpu_available = _is_gpu_available()
if gpu_available == "nvidia":
gpu_stats = _parse_nvidia_smi_stats()
elif gpu_available == "rocm":
gpu_stats = _parse_rocm_smi_stats()
@Set27

Set27 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author
  • Rewrite using json output instead of csv
  • Fix current test
  • Add new test
assert response.json()["gpu"] == []


def test_usage_rocm_gpu(client: TestClient) -> None:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would highly appreciate example output for nvidia-smi to add test for it

@mscolnick

Copy link
Copy Markdown
Contributor
@cubic-dev-ai

cubic-dev-ai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@cubic-dev-ai

@mscolnick I have started the AI code review. It will take a few minutes to complete.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="marimo/_server/api/endpoints/health.py">

<violation number="1" location="marimo/_server/api/endpoints/health.py:452">
P2: ROCm GPU stats silently report all-zero memory when expected JSON keys are missing</violation>
</file>
Architecture diagram
sequenceDiagram
    participant Client as Client
    participant HealthAPI as /api/usage endpoint
    participant GpuDetect as _is_gpu_available()  
    participant NvidiaSMI as nvidia-smi process
    participant RocmSMI as rocm-smi process
    participant GpuParser as GPU stats parser
    
    Note over Client,GpuParser: GET /api/usage with ROCM GPU support
    
    Client->>HealthAPI: GET /api/usage
    HealthAPI->>HealthAPI: Collect CPU, memory, network stats
    
    HealthAPI->>GpuDetect: Check GPU availability
    alt No GPU tools found
        GpuDetect-->>HealthAPI: return False
        HealthAPI-->>Client: GPU stats = []
    else NVIDIA GPU detected
        GpuDetect->>NvidiaSMI: subprocess.run(_NVIDIA_GPU_STATS_CMD)
        NvidiaSMI-->>GpuDetect: CSV stdout
        GpuDetect-->>HealthAPI: return "nvidia"
        HealthAPI->>GpuParser: _parse_nvidia_smi_stats()
        GpuParser->>GpuParser: Parse CSV lines, handle [N/A]
        GpuParser-->>HealthAPI: list of GPU dicts
    else AMD ROCM GPU detected
        GpuDetect->>RocmSMI: subprocess.run(_AMD_GPU_STATS_CMD)
        RocmSMI-->>GpuDetect: JSON stdout (with possible WARNING prefix)
        GpuDetect-->>HealthAPI: return "rocm"
        HealthAPI->>GpuParser: _parse_rocm_smi_stats()
        GpuParser->>GpuParser: Strip warning lines, parse JSON
        Note over GpuParser: Extract card#, Card Series, VRAM bytes
        alt JSON parse error
            GpuParser-->>HealthAPI: return []
        else Success
            GpuParser-->>HealthAPI: list of GPU dicts
        end
    end
    
    alt GPU process failure
        HealthAPI->>HealthAPI: Log warning, continue
    end
    
    HealthAPI-->>Client: JSON response with GPU stats
Loading

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread marimo/_server/api/endpoints/health.py
@mscolnick mscolnick added the enhancement New feature or request label Jun 10, 2026
@mscolnick

Copy link
Copy Markdown
Contributor

@Set27 looks good. some failing CI tests

@Set27 Set27 force-pushed the add-support-to-show-rocm-stats branch from 1eba709 to 3c94bcc Compare June 11, 2026 15:12
@Set27

Set27 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@Set27 looks good. some failing CI tests

I can't reproduce CLI failing running the same command locally; I guess I run intro transient, so I rebase on the latest main.
Colud you start workflows one more time?

@mscolnick mscolnick merged commit ad5cd89 into marimo-team:main Jun 11, 2026
36 of 39 checks passed
@Set27 Set27 deleted the add-support-to-show-rocm-stats branch June 11, 2026 15:44
@mscolnick

Copy link
Copy Markdown
Contributor

thank you for this feature @Set27 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

3 participants