Add CUDA process checkpointing helpers#1983
Conversation
396a2ca to
7c66b2f
Compare
|
/ok to test |
@kkraus14, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test 7c66b2f |
779c697 to
82f816c
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test |
|
82f816c to
25455d8
Compare
|
/ok to test |
25455d8 to
aaf1418
Compare
|
/ok to test |
aaf1418 to
d8a2031
Compare
|
/ok to test |
|
/ok to test |
| from cuda import cuda as _driver | ||
|
|
||
|
|
||
| ProcessStateT = _Literal["running", "locked", "checkpointed", "failed"] |
There was a problem hiding this comment.
Expose this to cuda.core.typing and then add it to api_private.rst to make it rendered by Sphinx
| 0: "running", | ||
| 1: "locked", | ||
| 2: "checkpointed", | ||
| 3: "failed", |
There was a problem hiding this comment.
nit: use the actual enumerators as key, instead of plain Python ints
|
|
||
| from cuda.core import checkpoint | ||
|
|
||
| process = checkpoint.Process(pid) |
There was a problem hiding this comment.
Q: Should we teach users how to get pid in Python?
There was a problem hiding this comment.
Maybe we should call os.getpid() here?
There was a problem hiding this comment.
This would allow check pointing the non current process so I don't think using os.getpid() is appropriate?
There was a problem hiding this comment.
It might be worth pointing out in api.rst that this is typically used to checkpoint a different process.
There was a problem hiding this comment.
This would allow check pointing the non current process
this is typically used to checkpoint a different process.
I think os.getpid() allows for checkpointing self, which is useful as demo'd in the linked code.
I do not believe all PIDs are allowed. I assume only processes owned by the current user can be checkpointed (either limited by the Linux kernel or the CUDA driver).
In any case, the example snippet in api.rst isn't very clear with the current 4 lines of code (lock -> checkpoint -> restore -> unlock). It is not the full story. There are lots of things that need to happen behind the scene. It was the main reason I started digging all of these myself without relying on AI.
There was a problem hiding this comment.
FWIW, my understanding is that checkpointing is something you'd typically do to a process, analogous to sending a signal or attaching a debugger. Linux handles the permissions, and checkpointing requires CAP_SYS_PTRACE, the same permissions needed to attach a debugger or run strace against another user's process. One might expect a system admin to run it with sudo privileges.
The main purpose of CUDA checkpoint is to ensure everything managed by CUDA resides in CPU user space so that a tool such as CRIU can capture a complete process image. Without this, CRIU would miss the GPU state.
Use cases:
- Migrate a GPU workload to a different system.
- Periodically checkpoint a long-running job so it can be quickly resumed after a potential system failure.
- Preempt GPU resources to favor a job with higher priority.
These fit naturally into a system-admin role. It looks like CUDA allows a process to checkpoint itself, but it seems to me the use cases would be niche.
| pairs = [] | ||
| for old_uuid, new_uuid in gpu_mapping.items(): | ||
| pair = driver.CUcheckpointGpuPair() | ||
| pair.oldUuid = old_uuid | ||
| pair.newUuid = new_uuid | ||
| pairs.append(pair) | ||
|
|
||
| if not pairs: | ||
| return None | ||
|
|
||
| args = driver.CUcheckpointRestoreArgs() | ||
| args.gpuPairs = pairs | ||
| args.gpuPairsCount = len(pairs) |
There was a problem hiding this comment.
(The CUDA docs is pretty lacking unfortunately...)
There was a problem hiding this comment.
Sorry, but I find this test suite very problematic. Why are we mocking the entire tests? This only works when we implement the checkpoint module in pure Python. Once we lower to Cython/C++, it won't work. Plus, across cuda-core we never, ever mock the tests. We always require GPU machines to test cuda-core functionalities. This seems like an agentic laziness to avoid writing/running GPU tests from within sandbox!
There was a problem hiding this comment.
I had my agent mock the test suite because CUDA check pointing requires more than just interacting with the CUDA driver / other libraries. It requires using CRIU which requires a whole bunch of kernel capabilities alongside building a harness for process management. It felt like it would be fragile to try to implement reliable actual tests because of this so I opted for mocking the driver given the API surface is quite small.
Happy to give it a shot of having actual tests that checkpoint and restore a process instead of mocking things if you think that would be fruitful.
There was a problem hiding this comment.
If we write the tests using Andy’s min-2-GPU decorator fixture, we can test the GPU migration capability without CRIU. The idea is that we shuffle each GPU’s state to the next one (and wrap around).
| gpu_mapping : mapping, optional | ||
| GPU UUID remapping from each checkpointed GPU UUID to the GPU UUID | ||
| to restore onto. If provided, the mapping must contain every | ||
| checkpointed GPU UUID. |
There was a problem hiding this comment.
I think this is why a real, instead of mocking, test is a MUST. Apparently, the API doc and the example code diverge here. The latter requests "all devices visible to CUDA" must appear in the mapping, not just those participating in checkpointing (as indicated by the former). We should get clarification on this (even better, find a way to test this).
Summary
cuda.core.checkpointmodule for CUDA process checkpointing APIscheckpoint.Process;Process.statenow returns typed string states ("running","locked","checkpointed", or"failed") instead of a public enumcheckpoint.Process(pid):state,restore_thread_id,lock,checkpoint,restore, andunlockProcess.restore(gpu_mapping=...)and converting it to the driverCUcheckpointGpuPair/CUcheckpointRestoreArgsstructures internallycuda.core.system, which remains focused on CUDA system and NVML capabilitiescuda-bindingsversion, required binding symbols, and CUDA driver versionCloses #1343
Testing
pixi run ruff check cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.pypixi run ruff format cuda_core/cuda/core/checkpoint.py cuda_core/tests/test_checkpoint.pypixi run --manifest-path cuda_core pytest cuda_core/tests/test_checkpoint.py(25 passed)SPHINX_CUDA_CORE_VER=0.7.1.dev63 BUILD_LATEST=1 pixi run --manifest-path cuda_core -e docs sphinx-build -b html -W --keep-going -j 4 cuda_core/docs/source /tmp/cuda_core_docs_checkpoint_verify_2pixi run --manifest-path cuda_core test(2817 passed,346 skipped,2 failedin local NVML/system tests; checkpoint tests passed)git diff --checkThe current checkpoint tests are implemented as focused unit tests in
cuda_core/tests/test_checkpoint.py. They use a small mock CUDA driver surface and monkeypatchcheckpoint._get_driver()so the behavioral tests do not require a live checkpoint-capable driver or process. The mock driver records each driver call and provides minimal stand-ins forCUcheckpointLockArgs,CUcheckpointRestoreArgs,CUcheckpointGpuPair,CUresult, and process states.The tests cover public symbol exposure, string process state mapping, restore thread queries, lock timeout argument construction, checkpoint/unlock null argument behavior, restore GPU UUID mapping conversion, empty restore mappings, input validation for
pid,timeout_ms, andgpu_mapping, unsupported-driver error translation, missing runtime checkpoint symbol translation, cached availability checks, unsupportedcuda-bindingsversions, missing binding symbols, and unsupported driver versions.The two local full-suite failures are the existing NVML/system environment-sensitive failures we are ignoring for this PR:
tests/system/test_system_device.py::test_get_inforom_versionreturns an empty InfoROM board part number locally.tests/system/test_system_system.py::test_get_process_namehits an NVML UTF-8 decode error locally.