Add CUDA process checkpointing helpers by kkraus14 · Pull Request #1983 · NVIDIA/cuda-python

kkraus14 · 2026-04-28T16:19:57Z

Summary

add a dedicated cuda.core.checkpoint module for CUDA process checkpointing APIs
expose a narrow runtime API via checkpoint.Process; Process.state now returns typed string states ("running", "locked", "checkpointed", or "failed") instead of a public enum
model checkpoint operations with checkpoint.Process(pid): state, restore_thread_id, lock, checkpoint, restore, and unlock
support restore-time GPU UUID remapping by accepting a mapping in Process.restore(gpu_mapping=...) and converting it to the driver CUcheckpointGpuPair / CUcheckpointRestoreArgs structures internally
keep checkpointing separate from cuda.core.system, which remains focused on CUDA system and NVML capabilities
validate checkpoint API availability lazily and cache the successful check, covering the cuda-bindings version, required binding symbols, and CUDA driver version
document the checkpoint lifecycle, Linux support scope, restore-thread requirement, restore/unlock state transition, and 1.0.0 release-note coverage

Testing

pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py
pixi run ruff format cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.py
pixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py (25 passed)
SPHINX_CUDA_CORE_VER=0.7.1.dev63 BUILD_LATEST=1 pixi run --manifest-path cuda_core -e docs sphinx-build -b html -W --keep-going -j 4 cuda_core/docs/source /tmp/cuda_core_docs_checkpoint_verify_2
pixi run --manifest-path cuda_core test (2817 passed, 346 skipped, 2 failed in local NVML/system tests; checkpoint tests passed)
git diff --check

The current checkpoint tests are implemented as focused unit tests in cuda_core/tests/test_checkpoint.py. They use a small mock CUDA driver surface and monkeypatch checkpoint._get_driver() so the behavioral tests do not require a live checkpoint-capable driver or process. The mock driver records each driver call and provides minimal stand-ins for CUcheckpointLockArgs, CUcheckpointRestoreArgs, CUcheckpointGpuPair, CUresult, and process states.

The tests cover public symbol exposure, string process state mapping, restore thread queries, lock timeout argument construction, checkpoint/unlock null argument behavior, restore GPU UUID mapping conversion, empty restore mappings, input validation for pid, timeout_ms, and gpu_mapping, unsupported-driver error translation, missing runtime checkpoint symbol translation, cached availability checks, unsupported cuda-bindings versions, missing binding symbols, and unsupported driver versions.

The two local full-suite failures are the existing NVML/system environment-sensitive failures we are ignoring for this PR:

tests/system/test_system_device.py::test_get_inforom_version returns an empty InfoROM board part number locally.
tests/system/test_system_system.py::test_get_process_name hits an NVML UTF-8 decode error locally.

copy-pr-bot · 2026-04-28T16:20:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:31:42Z

/ok to test

copy-pr-bot · 2026-04-28T16:31:46Z

/ok to test

@kkraus14, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

kkraus14 · 2026-04-28T16:38:41Z

/ok to test 7c66b2f

copy-pr-bot · 2026-04-28T16:44:37Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

kkraus14 · 2026-04-28T16:44:59Z

/ok to test

github-actions · 2026-04-28T17:07:34Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1983/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1983/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

kkraus14 · 2026-04-28T18:33:27Z

/ok to test

kkraus14 · 2026-04-28T19:14:18Z

/ok to test

kkraus14 · 2026-04-28T20:24:54Z

/ok to test

kkraus14 · 2026-04-29T19:42:51Z

/ok to test

leofang · 2026-05-01T03:30:00Z

+    from cuda import cuda as _driver
+
+
+ProcessStateT = _Literal["running", "locked", "checkpointed", "failed"]


Expose this to cuda.core.typing and then add it to api_private.rst to make it rendered by Sphinx

leofang · 2026-05-01T03:31:41Z

+    0: "running",
+    1: "locked",
+    2: "checkpointed",
+    3: "failed",


nit: use the actual enumerators as key, instead of plain Python ints

leofang · 2026-05-01T03:33:24Z

+
+   from cuda.core import checkpoint
+
+   process = checkpoint.Process(pid)


Q: Should we teach users how to get pid in Python?

xref: https://github.com/NVIDIA/cuda-checkpoint/blob/1c71b048c779d30921f01172dda96e04033bb8ac/src/r580-migration-api.c#L90

Maybe we should call os.getpid() here?

This would allow check pointing the non current process so I don't think using os.getpid() is appropriate?

It might be worth pointing out in api.rst that this is typically used to checkpoint a different process.

This would allow check pointing the non current process

this is typically used to checkpoint a different process.

I think os.getpid() allows for checkpointing self, which is useful as demo'd in the linked code.

I do not believe all PIDs are allowed. I assume only processes owned by the current user can be checkpointed (either limited by the Linux kernel or the CUDA driver).

In any case, the example snippet in api.rst isn't very clear with the current 4 lines of code (lock -> checkpoint -> restore -> unlock). It is not the full story. There are lots of things that need to happen behind the scene. It was the main reason I started digging all of these myself without relying on AI.

FWIW, my understanding is that checkpointing is something you'd typically do to a process, analogous to sending a signal or attaching a debugger. Linux handles the permissions, and checkpointing requires CAP_SYS_PTRACE, the same permissions needed to attach a debugger or run strace against another user's process. One might expect a system admin to run it with sudo privileges.

The main purpose of CUDA checkpoint is to ensure everything managed by CUDA resides in CPU user space so that a tool such as CRIU can capture a complete process image. Without this, CRIU would miss the GPU state.

Use cases:

Migrate a GPU workload to a different system.

Periodically checkpoint a long-running job so it can be quickly resumed after a potential system failure.

Preempt GPU resources to favor a job with higher priority.

These fit naturally into a system-admin role. It looks like CUDA allows a process to checkpoint itself, but it seems to me the use cases would be niche.

leofang · 2026-05-01T03:40:50Z

+    pairs = []
+    for old_uuid, new_uuid in gpu_mapping.items():
+        pair = driver.CUcheckpointGpuPair()
+        pair.oldUuid = old_uuid
+        pair.newUuid = new_uuid
+        pairs.append(pair)
+
+    if not pairs:
+        return None
+
+    args = driver.CUcheckpointRestoreArgs()
+    args.gpuPairs = pairs
+    args.gpuPairsCount = len(pairs)


xref: https://github.com/NVIDIA/cuda-checkpoint/blob/1c71b048c779d30921f01172dda96e04033bb8ac/src/r580-migration-api.c#L77-L88

(The CUDA docs is pretty lacking unfortunately...)

leofang · 2026-05-01T03:47:45Z

Sorry, but I find this test suite very problematic. Why are we mocking the entire tests? This only works when we implement the checkpoint module in pure Python. Once we lower to Cython/C++, it won't work. Plus, across cuda-core we never, ever mock the tests. We always require GPU machines to test cuda-core functionalities. This seems like an agentic laziness to avoid writing/running GPU tests from within sandbox!

I had my agent mock the test suite because CUDA check pointing requires more than just interacting with the CUDA driver / other libraries. It requires using CRIU which requires a whole bunch of kernel capabilities alongside building a harness for process management. It felt like it would be fragile to try to implement reliable actual tests because of this so I opted for mocking the driver given the API surface is quite small.

Happy to give it a shot of having actual tests that checkpoint and restore a process instead of mocking things if you think that would be fruitful.

If we write the tests using Andy’s min-2-GPU ~~decorator~~ fixture, we can test the GPU migration capability without CRIU. The idea is that we shuffle each GPU’s state to the next one (and wrap around).

leofang · 2026-05-01T05:21:17Z

+        gpu_mapping : mapping, optional
+            GPU UUID remapping from each checkpointed GPU UUID to the GPU UUID
+            to restore onto. If provided, the mapping must contain every
+            checkpointed GPU UUID.


I think this is why a real, instead of mocking, test is a MUST. Apparently, the API doc and the example code diverge here. The latter requests "all devices visible to CUDA" must appear in the mapping, not just those participating in checkpointing (as indicated by the former). We should get clarification on this (even better, find a way to test this).

github-actions Bot added the cuda.core Everything related to the cuda.core module label Apr 28, 2026

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 396a2ca to 7c66b2f Compare April 28, 2026 16:28

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch 2 times, most recently from 779c697 to 82f816c Compare April 28, 2026 16:44

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/system/__init__.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 82f816c to 25455d8 Compare April 28, 2026 18:22

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from 25455d8 to aaf1418 Compare April 28, 2026 19:14

kkraus14 commented Apr 28, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Add CUDA process checkpointing helpers

d8a2031

kkraus14 force-pushed the kk/issue-1343-cuda-checkpointing branch from aaf1418 to d8a2031 Compare April 28, 2026 20:24

kkraus14 marked this pull request as ready for review April 29, 2026 13:59

kkraus14 added the feature New feature or request label Apr 29, 2026

kkraus14 added this to the cuda.core v1.0.0 milestone Apr 29, 2026

kkraus14 self-assigned this Apr 29, 2026

rparolin requested review from leofang and rparolin April 29, 2026 17:44

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/tests/test_checkpoint.py Outdated

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

rparolin reviewed Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

leofang requested changes Apr 29, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py Outdated

Comment thread cuda_core/cuda/core/checkpoint.py

leofang requested a review from Andy-Jost April 29, 2026 18:05

Address checkpoint review feedback

4992921

Andy-Jost approved these changes Apr 30, 2026

View reviewed changes

Comment thread cuda_core/cuda/core/checkpoint.py

Comment thread cuda_core/cuda/core/checkpoint.py

leofang self-requested a review April 30, 2026 18:58

leofang added the P1 Medium priority - Should do label May 1, 2026

leofang reviewed May 1, 2026

View reviewed changes

leofang requested changes May 1, 2026

View reviewed changes

leofang reviewed May 1, 2026

View reviewed changes

		from cuda import cuda as _driver


		ProcessStateT = _Literal["running", "locked", "checkpointed", "failed"]


		from cuda.core import checkpoint

		process = checkpoint.Process(pid)

Conversation

kkraus14 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

copy-pr-bot Bot commented Apr 28, 2026

kkraus14 commented Apr 28, 2026

copy-pr-bot Bot commented Apr 28, 2026

kkraus14 commented Apr 28, 2026

copy-pr-bot Bot commented Apr 28, 2026

kkraus14 commented Apr 28, 2026

github-actions Bot commented Apr 28, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kkraus14 commented Apr 28, 2026

kkraus14 commented Apr 28, 2026

Uh oh!

kkraus14 commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kkraus14 commented Apr 29, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Andy-Jost May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

4 participants

kkraus14 commented Apr 28, 2026 •

edited

Loading

Andy-Jost May 1, 2026 •

edited

Loading