Add managed-memory advise, prefetch, and discard-prefetch free functions by rparolin · Pull Request #1775 · NVIDIA/cuda-python

rparolin · 2026-03-17T00:38:04Z

Summary

Adds managed-memory range operations to cuda.core:

Free functions in cuda.core.utils: advise, prefetch, discard, discard_prefetch. Each accepts either a single Buffer or a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the corresponding cuMem*BatchAsync (CUDA 13+).
Host — new top-level class symmetric to Device. Host() (any host), Host(numa_id=N), Host.numa_current(). Used together with Device to express managed-memory locations.
ManagedBuffer — Buffer subclass returned by ManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer with Buffer.from_handle(...) (now a @classmethod, so ManagedBuffer.from_handle(...) returns a ManagedBuffer).
*Options dataclasses — AdviseOptions, PrefetchOptions, DiscardOptions, DiscardPrefetchOptions. Frozen dataclasses reserved for future per-call flags; current ABI has no flags worth surfacing, but the dataclasses establish the contract so future flags land without an API break.

Closes #1332. Addresses the managed-memory portion of #1333 (P1: cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync). The P0 cuMemcpyBatchAsync from #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcoming cuMemcpyBatchAsync work can mirror it.

Public API

`ManagedBuffer` — property-style advice on managed allocations

ManagedMemoryResource.allocate returns a ManagedBuffer (a Buffer subclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.

from cuda.core import Device, Host, ManagedMemoryResource

mr = ManagedMemoryResource()
buf = mr.allocate(size)                # ManagedBuffer

# Driver-backed properties — getter queries the driver, setter calls cuMemAdvise.
buf.read_mostly = True
buf.preferred_location = Device(0)     # or Host(), or Host(numa_id=N)
buf.preferred_location = None          # unset

# Live set-like view of `set_accessed_by` advice.
buf.accessed_by.add(Device(1))
buf.accessed_by.discard(Device(1))
buf.accessed_by = {Device(0), Device(1)}   # diff vs current; advise only deltas

# Instance methods delegate to the matching free functions.
buf.prefetch(Device(0), stream=stream)
buf.discard(stream=stream)
buf.discard_prefetch(Device(0), stream=stream)

Note: the legacy cuMemRangeGetAttribute query path returns integer device ordinals, so Host(numa_id=...) collapses to a generic Host() on read-back. Setters preserve full NUMA information when issuing advice.

Free functions — `advise` / `prefetch` / `discard` / `discard_prefetch`

Each accepts a Buffer (or ManagedBuffer) or a sequence of them. Locations are expressed via Device, Host, or int (-1 → host, >=0 → device ordinal).

from cuda.core import Device, Host
from cuda.core.utils import advise, prefetch, discard, discard_prefetch

# Stage to GPU, kernel, bring back to host
prefetch(buf, Device(0), stream=stream)
launch_my_kernel(buf, stream=stream)
prefetch(buf, Host(), stream=stream)
stream.sync()
result = bytes(buf)

# int shorthand: -1 = host, >=0 = device ordinal
prefetch(buf, -1, stream=stream)

# Advice
advise(weights, "set_read_mostly")
advise(activations, "set_preferred_location", Device(0))
advise(scratch, "set_accessed_by", Device(0))

# Discard / discard+prefetch (CUDA 13+)
discard(scratch, stream=stream)
for step in range(num_steps):
    discard_prefetch(activations, Device(0), stream=stream)
    run_forward(activations, stream=stream)

Batched form — same function, sequence of targets

When N>1, dispatch goes to the corresponding cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.

# Pair-by-index: output → GPU 0, log_metrics → host
prefetch(
    [output, log_metrics],
    [Device(0), Host()],
    stream=stream,
)

# Scalar broadcast: every shard moves to GPU 0
prefetch([shard_a, shard_b, shard_c], Device(0), stream=stream)

Mismatched sequence lengths raise ValueError. On a CUDA 12 build of cuda.core, N>1 raises NotImplementedError (the *BatchAsync entry points are CUDA 13+); N==1 works on every supported toolkit.

Putting it together

weights = mr.allocate(weights_size)    # ManagedBuffer
inputs  = mr.allocate(inputs_size)
output  = mr.allocate(output_size)

# One-time hints (property API on ManagedBuffer)
weights.read_mostly = True
weights.preferred_location = Device(0)
output.preferred_location = Device(0)

# Per inference
inputs.prefetch(Device(0), stream=stream)
run_inference(weights, inputs, output, stream=stream)
output.prefetch(Host(), stream=stream)
stream.sync()

Implementation notes

Cython implementation in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx uses cimport cydriver for direct C-level driver calls.
The CUDA 12 / 13 ABI split for cuMemAdvise and cuMemPrefetchAsync is handled at compile time with IF CUDA_CORE_BUILD_MAJOR >= 13: / ELSE:.
Batched entry points (cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raise NotImplementedError; single-buffer calls work everywhere.
ManagedBuffer is a pure-Python subclass of the Cython Buffer cdef class. Buffer.from_handle is now a @classmethod (was @staticmethod) so MyBufferSubclass.from_handle(...) returns the typed instance via cls._init. Buffer_from_deviceptr_handle and _MP_allocate thread an optional cls parameter so ManagedMemoryResource.allocate materializes a ManagedBuffer.
Internal _LocSpec (in _managed_location.py) carries the (kind, id) discriminator that the Cython layer maps to CUmemLocation (CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see only Device / Host / int; _coerce_location produces the internal record.
_buffer.pyx collapses out.is_managed = (is_managed != 0) to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured by CU_POINTER_ATTRIBUTE_IS_MANAGED.
The defensive cuInit retry in _query_memory_attrs was removed; we don't auto-init CUDA elsewhere.

Tests

Managed-memory tests live in cuda_core/tests/memory/test_managed_ops.py: free-function dispatch (single + batched + mismatch + non-managed rejection), Host constructors and frozen-dataclass semantics, internal _coerce_location for Device | Host | int | None, full ManagedBuffer property roundtrips (read_mostly, preferred_location, accessed_by add/discard/assignment), and instance methods. The broader memory-tests reorg (buffer / managed_resource / pinned / vmm "siblings") is tracked as a separate cleanup PR.

Deferred follow-ups

HMM/ATS-aware is_managed semantics — flagged as a TODO in _buffer.pyx, tracked alongside the broader HMM/ATS work.
cuMemcpyBatchAsync (P0 of Support batched memory movement #1333) — different family, separate PR; will mirror the contract in #issuecomment-4355502334.
Concrete fields on the *Options dataclasses — they're empty today; concrete options land when CUDA introduces per-call flags worth surfacing.
CUDA 13 split-attribute read-back for preferred_location / accessed_by — currently uses the legacy combined attribute (Python binding limitation), which loses NUMA fidelity on round-trip. Setters preserve full NUMA info.

copy-pr-bot · 2026-03-17T00:38:08Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-03-17T01:07:52Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1775/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1775/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2026-03-17T16:42:11Z

/ok to test

jrhemstad · 2026-03-17T16:50:19Z

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

rparolin · 2026-03-17T19:35:28Z

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore.

…ns in the cuda.core.managed_memory namespace

…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Buffer.from_handle is now a classmethod that dispatches via cls._init, so subclasses inherit it: ManagedBuffer.from_handle(...) returns a ManagedBuffer with no override needed. Drop ManagedBuffer.from_handle. - Hoist `advise / prefetch / discard / discard_prefetch` imports from per-method lazy imports to module-level (no circular import: they live in cuda.core._memory._managed_memory_ops, not cuda.core.utils). - Cache the CUmem_advise and CUmem_range_attribute enum lookups at module level and pass enum constants directly to advise() instead of re-resolving from string aliases on every property write. - Extract _query_accessed_by as a module-level helper; AccessedBySet delegates and the accessed_by setter calls it directly instead of constructing a throwaway view.

leofang

I need to run. Will try to revisit tonight. I haven't done reviewing (too many lines).

leofang · 2026-04-30T22:25:36Z

+from dataclasses import dataclass
+
+
+@dataclass(frozen=True)


I dunno if this is a collective agentic illusion or what, in recent PRs I've seen many data classes. Why do we need one here?

When I asked claude why he selected a dataclass... I'll remove that decorator.

"A. Reply only — defend the dataclass

The dataclass is doing real work, not decoration:

eq/hash are tested behaviors — tests/memory/test_managed_ops.py:323-327 asserts Host() == Host(), hash(Host(numa_id=1)) ==
hash(Host(numa_id=1)), etc.

Host is used in set/list comparisons in _managed_buffer.py:49 ([Host() if v == -1 else Device(v) for v in raw...]) — needs hashability if
it ever lands in a set.

frozen=True ensures users can't mutate a Host after stashing it on ManagedBuffer.preferred_location."

My bot reviewed and raised this idea: Host should follow Device and be a singleton class. @Andy-Jost thoughts?

Andy-Jost

None of my comments are blocking. Looks ready to me.

Per Andy's review nit (PR NVIDIA#1775, _managed_memory_ops.pyx:207), replace the manual PyMem_Malloc / PyMem_Free pattern in the three batch helpers (_do_batch_discard, _do_batch_prefetch, _do_batch_discard_prefetch) with libcpp.vector. RAII handles cleanup, eliminating the manual try/finally and removing a leak window if _to_cumemlocation raised mid-fill. Matches the precedent used in _program.pyx, _linker.pyx, _kernel_arg_handler.pyx, _graph_node.pyx, and others. Net change: 53 insertions, 85 deletions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y_memory_attrs (R4) Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:455), restore the auto-init retry that was removed in 10de998. cuPointerGetAttributes is the first driver call _query_memory_attrs makes, and a NOT_INITIALIZED result here would otherwise propagate out of every is_managed / is_host_accessible / is_device_accessible query before the user has called any other Device API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Leo's review on PR NVIDIA#1775 (_host.py:9), drop the @DataClass(frozen=True) in favor of a hand-written class with property accessors. Matches Leo's original sketch from the 2026-04-28 drive-by comment and aligns with how Device is structured in this codebase. Behavior preserved: Host(), Host(numa_id=N), and Host.numa_current() all work identically. __eq__, __hash__, and immutability are hand-rolled rather than dataclass-generated. is_numa_current is no longer an __init__ kwarg — it's internal state settable only via the Host.numa_current() classmethod. Two existing TestHost cases updated: - test_numa_current_with_id_rejected → test_numa_current_only_via_classmethod - test_frozen → test_immutable (AttributeError instead of FrozenInstanceError) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (R6, R8) Per Leo's review on PR NVIDIA#1775 (_managed_buffer.py:165) and Andy's parallel question (line 144), drop the `int` shorthand for prefetch/discard_prefetch/advise locations. The previous design accepted `Device | Host | int` where `int >= 0` meant a device ordinal and `-1` magically meant host. With first-class `Device` and `Host`, the int form was redundant and the `-1 → Host` magic was surprising. Public API change: prefetch(buf, Device(0), stream=...) # was: prefetch(buf, 0, stream=...) prefetch(buf, Host(), stream=...) # was: prefetch(buf, -1, stream=...) This also resolves an inconsistency: ManagedBuffer.preferred_location already accepted only Device | Host | None, but prefetch() and discard_prefetch() accepted int. Now uniformly Device | Host. Pre-1.0 breaking change. Anyone using the int shorthand should switch to the explicit Device(N) / Host() form. Files touched: - _managed_location.py: drop the int branch from _coerce_location; TypeError now reads "Device, Host, or None" - _managed_buffer.py: type signatures `Device | Host | int` → `Device | Host` - _managed_memory_ops.pyx: docstring updates (3 occurrences) - tests/memory/test_managed_ops.py: replace int call sites with Host()/Device(N); collapse three int-branch tests into one test_int_rejected - 1.0.0-notes.rst: drop the "int values are also accepted" sentence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Andy's review on PR NVIDIA#1775 (_managed_buffer.py:52), document `AccessedBySet` in the private API reference. It is returned by `ManagedBuffer.accessed_by` but not directly instantiable by users — matches the existing `_memory._ipc.*` entries in the same section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

leofang

Code-level review focusing on DRY, verbosity, and test coverage — complementing the design-level comments already on the PR.

Summary of inline comments:

_do_batch_prefetch / _do_batch_discard_prefetch are copy-pasted (~40 lines) — parameterize into one function
Options isinstance check repeated 4x — extract a one-liner helper; also consolidate the near-identical prefetch / discard_prefetch preambles
CUDA 12 batch fallback — loop over singles instead of NotImplementedError (the batch semantics are documented as equivalent to individual calls)
_normalize_managed_advice over-engineered — the alias dict + lazy reverse dict can be replaced with a getattr on the naming convention (~15 lines saved)
Test setup boilerplate — ~25 tests repeat the same 5-line preamble; a pytest fixture would save ~75 lines
Test helper duplicated — _get_mem_range_attr in the test is identical to _get_int_attr in production code
Test coverage gaps — CUDA 12 batch fallback, AccessedBySet iteration, stream=None, error message assertions

leofang · 2026-05-01T01:07:30Z

+
+
+def test_managed_memory_prefetch_supports_managed_pool_allocations(init_cuda):
+    device = Device()


nit: Nearly every test function in this file repeats the same 5-line preamble:

device = Device() _skip_if_managed_location_ops_unsupported(device) # or variant device.set_current() mr = create_managed_memory_resource_or_skip() buffer = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE)

This appears in ~25 tests. A @pytest.fixture would eliminate this boilerplate and save ~75 lines:

@pytest.fixture def managed_buffer(init_cuda): device = Device() _skip_if_managed_location_ops_unsupported(device) device.set_current() mr = create_managed_memory_resource_or_skip() buf = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE) yield buf buf.close() mr.close()

Tests that need a different skip level (e.g. _skip_if_managed_discard_prefetch_unsupported) could use a second fixture or parametrize. The local _skip_if_* helpers also partially overlap with conftest's skip_if_managed_memory_unsupported — worth consolidating.

…red_location (R2, R7) Per Leo's questions on PR NVIDIA#1775 (_host.py:26 and _managed_buffer.py:140): R2 (Host numa_id): the dataclass surface is intentional. Three forms already cover the use cases — Host() / Host(numa_id=N) / Host.numa_current(). Auto-inferring numa_id at Host() construction would conflict with the "generic host" semantic. R7 (preferred_location getter): the underlying limitation is real but upstream-blocked. The legacy CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION returns only a single int (device id, -1 host, -2 none) — no NUMA. CUDA 13 added _PREFERRED_LOCATION_TYPE / _ID for full round-trip, and they are exposed in cydriver, but cuda.bindings' _HelperCUmem_range_attribute does not yet recognize them — calling driver.cuMemRangeGetAttribute with the new attributes raises "Unsupported attribute". Once cuda.bindings adds them, this getter can query the v2 attributes and return Host(numa_id=N). Add a docstring note documenting the limitation so users aren't surprised by the lossy round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…12, R13) Per Andy's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:102 and :118), replace `isinstance(x, (list, tuple))` with `isinstance(x, Sequence)` in `_coerce_buffer_targets` and `_broadcast_locations`. Matches the existing precedent in `cuda.core._utils.cuda_utils.is_sequence()`. The widened input set also accepts `str`, but neither `Buffer` nor `Location` is stringly-typed, so a `str` input still raises — just with a different message (Buffer cast error or Location TypeError from `_coerce_location`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@staticmethod

Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:135), make Buffer.from_handle a @staticmethod that always returns Buffer. Subclass-aware construction stays available via the private @classmethod Buffer._init, which is what Leo asked for ("use a private method for handling subclasses for now"). ManagedBuffer gains its own @classmethod from_handle that wraps cls._init, so user-facing call sites like ManagedBuffer.from_handle(ptr, size, owner=plain) continue to work unchanged. The narrowly-scoped subclass factory is on the subclass itself, not bolted onto Buffer's public surface. This addresses R3's spirit: cuda.core's public APIs no longer advertise generic subclass-construction support that conflicts with the broader subclassing story tracked in NVIDIA#750 / NVIDIA#1989. No test changes; behavior preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Leo's R11 ("if we prefer methods, don't expose free functions"): each managed-memory operation now has exactly one public surface, chosen by whether it acts on one buffer or many. Single buffer (instance methods + properties on ManagedBuffer): - buf.read_mostly = True - buf.preferred_location = Device(0) - buf.accessed_by.add(Device(1)) - buf.prefetch(Device(0), stream=stream) - buf.discard(stream=stream) - buf.discard_prefetch(Device(0), stream=stream) Multiple buffers (free functions in cuda.core.utils, CUDA 13+ only): - utils.prefetch_batch(buffers, locations, stream=stream) - utils.discard_batch(buffers, stream=stream) - utils.discard_prefetch_batch(buffers, locations, stream=stream) Removed: - cuda.core.utils.advise / prefetch / discard / discard_prefetch (single-buffer surfaces — replaced by ManagedBuffer methods/properties) - cuda.core._memory._managed_memory_options module and its four empty AdviseOptions / PrefetchOptions / DiscardOptions / DiscardPrefetchOptions dataclasses (R9 from Leo, R10 from Andy: empty placeholders that didn't carry information) - options=None parameter from every public surface - The single-buffer fast path inside the now-batched-only free functions; they always hit cuMem*BatchAsync now Internals: - Public def advise() deleted; _advise_one (cdef) is the new internal single-buffer entry point used by ManagedBuffer property setters. - Three new Python-level wrappers _do_single_prefetch_py / _do_single_discard_py / _do_single_discard_prefetch_py used by ManagedBuffer instance methods. These call the cdef _do_single_* helpers with the right Cython types after stream coercion. - _coerce_buffer_targets renamed to _coerce_batch_buffers; rejects a single Buffer with a TypeError pointing at the ManagedBuffer method. Tests: - TestPrefetch / TestDiscard / TestDiscardPrefetch / TestAdvise rewritten as TestPrefetchBatch / TestDiscardBatch / TestDiscardPrefetchBatch (batched-only, since single-buffer is covered by ManagedBuffer's TestManagedBuffer class) - Single-buffer external-allocation tests use ManagedBuffer.from_handle(plain.handle, plain.size, owner=plain) to wrap a DummyUnifiedMemoryResource buffer - options-related tests deleted (no options surface to test) - enum-value advise test deleted (property setters are typed; the string-alias / enum-value internal API isn't user-visible) Release notes updated. Closes R9, R10, R11. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ad (N4) Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:23), drop the lazy-init plumbing for the enum→alias reverse lookup table. The forward table _MANAGED_ADVICE_ALIASES has six entries; building the inverse at module load via a dict comprehension is the same data without the mutable-global pattern, the `if None` check, or the `global` declaration inside the function body. Forward lookup table (_MANAGED_ADVICE_ALIASES) is preserved as the source of truth — explicit alias→CUDA-name mapping, grep-friendly, no implicit naming-convention coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d_prefetch} (N2) Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:425), the two batched-with-locations helpers were byte-for-byte identical except for the driver function being called. Both: - declare the same four std::vectors (ptrs, sizes, loc_arr, loc_indices) - resize and fill them in the same loop - release the GIL and call cuMem{Prefetch,DiscardAndPrefetch}BatchAsync with the same argument shape Introduce a function-pointer typedef _BatchPrefetchFn (the two driver calls share signature), parameterize the shared body as _do_batch_prefetch_op, and have the two callers pass the appropriate driver function. Both the typedef and the helper live inside the IF CUDA_CORE_BUILD_MAJOR >= 13 block since they reference cu13-only types. Net: -28 lines duplication, +25 for the shared helper. No behavior change; tests unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts (N6) Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:28), the test file's _get_mem_range_attr / _get_int_mem_range_attr / the local _MEM_RANGE_ATTRIBUTE_VALUE_SIZE constant are functionally identical to the production _get_int_attr in _managed_buffer.py. Drop the duplicates and import the production helper. 14 call sites updated. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:228), raising NotImplementedError on cu12 forces users to write their own loop. The CUDA driver semantics for cuMemPrefetchBatchAsync are equivalent to per-range cuMemPrefetchAsync calls — just more efficient when batched at the driver level. On cu12 builds (where cuMemPrefetchBatchAsync is not exposed), fall back to a Python-level loop calling cuMemPrefetchAsync per buffer. The single-range path (_do_single_prefetch) already works on cu12 via the IF/ELSE split inside it. Note this fallback applies only to prefetch_batch — discard_batch and discard_prefetch_batch keep the cu12 NotImplementedError because the driver has no single-range cuMemDiscard{,AndPrefetch}Async to fall back to. Test skips for cuMemPrefetchBatchAsync unavailability dropped from TestPrefetchBatch.test_same_location and test_per_buffer_location; the fallback path now runs on cu12 builds too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:1), add a test for the read side of AccessedBySet: __iter__, __len__, __eq__, __repr__. These are part of the public set-like API (alongside __contains__, add(), discard(), and the setter, which are already covered) but were untested. The cu12 batch fallback path (Leo's other coverage point) is now exercised by TestPrefetchBatch.test_same_location and test_per_buffer_location running on cu12 CI — the cuMemPrefetchBatchAsync skip was dropped in d75a7bd when the fallback landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation (N8) Per the self-promised reply on PR NVIDIA#1775's R7 thread, fulfill the Host(numa_id=N) round-trip on CUDA 13 builds. The blocker before was that cuda.bindings's Python-level cuMemRangeGetAttribute wrapper rejects the new CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION_TYPE / _ID attributes via its allowlist. The workaround: call cydriver.cuMemRangeGetAttribute directly from a new Cython helper _read_preferred_location_v2, bypassing the Python wrapper. The helper queries TYPE then ID, then decodes the (kind, id) pair into Device | Host | Host(numa_id=N) | Host.numa_current() | None. ManagedBuffer.preferred_location getter dispatches to the v2 path on binding_version() >= (13, 0, 0); falls back to the legacy single-int attribute on cu12 (no NUMA info available). Test: - TestManagedBuffer.test_preferred_location_roundtrip already exercises the cu13 v2 path for Device(...) and Host() (no NUMA), which now passes through _read_preferred_location_v2. - New test_preferred_location_roundtrip_host_numa exercises Host(numa_id=0) round-trip; skips on cu12, and also skips on cu13 hardware/drivers where set_preferred_location with HOST_NUMA is not preserved (e.g. single-NUMA test machines). ManagedBuffer class docstring updated to reflect the cu12-only limitation note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rparolin · 2026-05-01T17:53:05Z

@leofang addressed all your feedback.

rparolin · 2026-05-01T18:08:32Z

Code review

Found 2 issues:

cuda_core/docs/source/api.rst documents 8 names that don't exist anywhere in the codebase: free functions advise, prefetch, discard, discard_prefetch and dataclasses AdviseOptions, PrefetchOptions, DiscardOptions, DiscardPrefetchOptions. The actual exports from cuda.core.utils are only prefetch_batch, discard_batch, discard_prefetch_batch (plus the pre-existing StridedMemoryView and args_viewable_as_strided_memory). Sphinx will fail to resolve these autosummary entries. Looks like leftover from the earlier API surface that was removed under R9/R11; release notes were updated, api.rst was missed.

https://github.com/NVIDIA/cuda-python/blob/b0d1a216e3932468d3801da5d83449465b3f8faf/cuda_core/docs/source/api.rst#L248-L268

The CUDA 12 ELSE fallbacks for the single-buffer advise and prefetch paths in _managed_memory_ops.pyx reference cydriver.cuMemAdvise and cydriver.cuMemPrefetchAsync with the legacy v1 4-argument int-device signature, but those public cydriver wrappers are only emitted when cuMemAdvise_v2 / cuMemPrefetchAsync_v2 are present in the toolkit headers (CUDA 13+). On a cu12 build the symbols don't exist (see cuda_bindings/cuda/bindings/cydriver.pxd.in:3840 and cydriver.pyx.in:1237-1247, both gated on 'cuMem*_v2' in found_functions), so the ELSE branch will fail to compile / import. The batched paths handle cu12 via a NotImplementedError + per-range loop, but the single-buffer paths assume a v1 wrapper that cuda-python never exposes.

cuda-python/cuda_core/cuda/core/_memory/_managed_memory_ops.pyx

Lines 255 to 325 in b0d1a21

    
                       cu_loc.type = cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_HOST 
        
                       cu_loc.id = 0 
        
                   else: 
        
                       cu_loc = _to_cumemlocation(loc) 
        
                   with nogil: 
        
                       HANDLE_RETURN(cydriver.cuMemAdvise(cu_ptr, nbytes, advice_enum, cu_loc)) 
        
               ELSE: 
        
                   cdef int dev_int = -1 if loc is None else _to_legacy_device(loc) 
        
                   with nogil: 
        
                       HANDLE_RETURN(cydriver.cuMemAdvise(cu_ptr, nbytes, advice_enum, dev_int)) 
        
           def prefetch_batch(buffers, locations, *, stream): 
        
               """Prefetch a batch of managed-memory ranges to target locations. 
        
               Requires CUDA 13+. For a single buffer, use 
        
               :meth:`ManagedBuffer.prefetch` instead. 
        
               Parameters 
        
               ---------- 
        
               buffers : Sequence[:class:`Buffer`] 
        
                   Two or more managed allocations to operate on. 
        
               locations : :class:`~cuda.core.Device` | :class:`~cuda.core.Host` | Sequence[...] 
        
                   Target location(s). A single location applies to all buffers; a 
        
                   sequence must match ``len(buffers)``. 
        
               stream : :class:`~_stream.Stream` | :class:`~graph.GraphBuilder` 
        
                   Stream for the asynchronous prefetch (keyword-only). 
        
               Notes 
        
               ----- 
        
               On a CUDA 12 build, falls back to a Python-level loop calling 
        
               ``cuMemPrefetchAsync`` per buffer (no batched driver entry point on 
        
               CUDA 12). CUDA 13 builds use ``cuMemPrefetchBatchAsync`` directly. 
        
               """ 
        
               cdef tuple bufs = _coerce_batch_buffers(buffers, "prefetch_batch") 
        
               cdef Py_ssize_t n = len(bufs) 
        
               cdef tuple locs = _broadcast_locations(locations, n, False, "prefetch_batch") 
        
               cdef Stream s = Stream_accept(stream) 
        
               cdef Buffer buf 
        
               for buf in bufs: 
        
                   _require_managed_buffer(buf, "prefetch_batch") 
        
               _do_batch_prefetch(bufs, locs, s) 
        
           def _do_single_prefetch_py(Buffer buf, location, stream): 
        
               """Internal: single-buffer prefetch for ManagedBuffer.prefetch(). 
        
               Uses cuMemPrefetchAsync (works on CUDA 12 and 13). 
        
               """ 
        
               _require_managed_buffer(buf, "prefetch") 
        
               cdef object loc = _coerce_location(location, allow_none=False) 
        
               cdef Stream s = Stream_accept(stream) 
        
               _do_single_prefetch(buf, loc, s) 
        
           cdef void _do_single_prefetch(Buffer buf, object loc, Stream s): 
        
               cdef cydriver.CUdeviceptr cu_ptr = as_cu(buf._h_ptr) 
        
               cdef size_t nbytes = buf._size 
        
               cdef cydriver.CUstream hstream = as_cu(s._h_stream) 
        
               IF CUDA_CORE_BUILD_MAJOR >= 13: 
        
                   cdef cydriver.CUmemLocation cu_loc = _to_cumemlocation(loc) 
        
                   with nogil: 
        
                       HANDLE_RETURN(cydriver.cuMemPrefetchAsync(cu_ptr, nbytes, cu_loc, 0, hstream)) 
        
               ELSE: 
        
                   cdef int dev_int = _to_legacy_device(loc) 
        
                   with nogil: 
        
                       HANDLE_RETURN(cydriver.cuMemPrefetchAsync(cu_ptr, nbytes, dev_int, hstream))

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

wip

abdec47

rparolin requested a review from Andy-Jost March 17, 2026 00:41

rparolin self-assigned this Mar 17, 2026

rparolin added this to the cuda.core v0.7.0 milestone Mar 17, 2026

wip

c418050

rparolin marked this pull request as ready for review March 17, 2026 00:45

rparolin marked this pull request as draft March 17, 2026 00:45

rparolin changed the title ~~wip~~ Mar 17, 2026

fixing ci compiler errors

b879fa5

rparolin marked this pull request as ready for review March 17, 2026 00:57

rparolin added 2 commits March 17, 2026 09:07

skipping tests that aren't supported

04ee3de

cu12 support

9ab3f46

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

bd75bc3

rparolin marked this pull request as draft March 17, 2026 19:35

rparolin added 3 commits March 17, 2026 12:37

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

1b1343b

Moving to function from Buffer class methods to free standing functio…

a948066

…ns in the cuda.core.managed_memory namespace

precommit format

1457599

rparolin marked this pull request as ready for review March 17, 2026 23:46

rparolin and others added 7 commits March 17, 2026 17:30

iterating on implementation

acb4024

Merge branch 'main' into rparolin/managed_mem_advise_prefetch

ae1de36