Skip to content

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775

Open
rparolin wants to merge 62 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch
Open

Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
rparolin wants to merge 62 commits intoNVIDIA:mainfrom
rparolin:rparolin/managed_mem_advise_prefetch

Conversation

@rparolin
Copy link
Copy Markdown
Collaborator

@rparolin rparolin commented Mar 17, 2026

Summary

Adds managed-memory range operations to cuda.core:

  • Free functions in cuda.core.utils: advise, prefetch, discard, discard_prefetch. Each accepts either a single Buffer or a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the corresponding cuMem*BatchAsync (CUDA 13+).
  • Host — new top-level class symmetric to Device. Host() (any host), Host(numa_id=N), Host.numa_current(). Used together with Device to express managed-memory locations.
  • ManagedBufferBuffer subclass returned by ManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer with Buffer.from_handle(...) (now a @classmethod, so ManagedBuffer.from_handle(...) returns a ManagedBuffer).
  • *Options dataclassesAdviseOptions, PrefetchOptions, DiscardOptions, DiscardPrefetchOptions. Frozen dataclasses reserved for future per-call flags; current ABI has no flags worth surfacing, but the dataclasses establish the contract so future flags land without an API break.

Closes #1332. Addresses the managed-memory portion of #1333 (P1: cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync). The P0 cuMemcpyBatchAsync from #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcoming cuMemcpyBatchAsync work can mirror it.

Public API

ManagedBuffer — property-style advice on managed allocations

ManagedMemoryResource.allocate returns a ManagedBuffer (a Buffer subclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.

from cuda.core import Device, Host, ManagedMemoryResource

mr = ManagedMemoryResource()
buf = mr.allocate(size)                # ManagedBuffer

# Driver-backed properties — getter queries the driver, setter calls cuMemAdvise.
buf.read_mostly = True
buf.preferred_location = Device(0)     # or Host(), or Host(numa_id=N)
buf.preferred_location = None          # unset

# Live set-like view of `set_accessed_by` advice.
buf.accessed_by.add(Device(1))
buf.accessed_by.discard(Device(1))
buf.accessed_by = {Device(0), Device(1)}   # diff vs current; advise only deltas

# Instance methods delegate to the matching free functions.
buf.prefetch(Device(0), stream=stream)
buf.discard(stream=stream)
buf.discard_prefetch(Device(0), stream=stream)

Note: the legacy cuMemRangeGetAttribute query path returns integer device ordinals, so Host(numa_id=...) collapses to a generic Host() on read-back. Setters preserve full NUMA information when issuing advice.

Free functions — advise / prefetch / discard / discard_prefetch

Each accepts a Buffer (or ManagedBuffer) or a sequence of them. Locations are expressed via Device, Host, or int (-1 → host, >=0 → device ordinal).

from cuda.core import Device, Host
from cuda.core.utils import advise, prefetch, discard, discard_prefetch

# Stage to GPU, kernel, bring back to host
prefetch(buf, Device(0), stream=stream)
launch_my_kernel(buf, stream=stream)
prefetch(buf, Host(), stream=stream)
stream.sync()
result = bytes(buf)

# int shorthand: -1 = host, >=0 = device ordinal
prefetch(buf, -1, stream=stream)

# Advice
advise(weights, "set_read_mostly")
advise(activations, "set_preferred_location", Device(0))
advise(scratch, "set_accessed_by", Device(0))

# Discard / discard+prefetch (CUDA 13+)
discard(scratch, stream=stream)
for step in range(num_steps):
    discard_prefetch(activations, Device(0), stream=stream)
    run_forward(activations, stream=stream)

Batched form — same function, sequence of targets

When N>1, dispatch goes to the corresponding cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.

# Pair-by-index: output → GPU 0, log_metrics → host
prefetch(
    [output, log_metrics],
    [Device(0), Host()],
    stream=stream,
)

# Scalar broadcast: every shard moves to GPU 0
prefetch([shard_a, shard_b, shard_c], Device(0), stream=stream)

Mismatched sequence lengths raise ValueError. On a CUDA 12 build of cuda.core, N>1 raises NotImplementedError (the *BatchAsync entry points are CUDA 13+); N==1 works on every supported toolkit.

Putting it together

weights = mr.allocate(weights_size)    # ManagedBuffer
inputs  = mr.allocate(inputs_size)
output  = mr.allocate(output_size)

# One-time hints (property API on ManagedBuffer)
weights.read_mostly = True
weights.preferred_location = Device(0)
output.preferred_location = Device(0)

# Per inference
inputs.prefetch(Device(0), stream=stream)
run_inference(weights, inputs, output, stream=stream)
output.prefetch(Host(), stream=stream)
stream.sync()

Implementation notes

  • Cython implementation in cuda_core/cuda/core/_memory/_managed_memory_ops.pyx uses cimport cydriver for direct C-level driver calls.
  • The CUDA 12 / 13 ABI split for cuMemAdvise and cuMemPrefetchAsync is handled at compile time with IF CUDA_CORE_BUILD_MAJOR >= 13: / ELSE:.
  • Batched entry points (cuMemPrefetchBatchAsync, cuMemDiscardBatchAsync, cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raise NotImplementedError; single-buffer calls work everywhere.
  • ManagedBuffer is a pure-Python subclass of the Cython Buffer cdef class. Buffer.from_handle is now a @classmethod (was @staticmethod) so MyBufferSubclass.from_handle(...) returns the typed instance via cls._init. Buffer_from_deviceptr_handle and _MP_allocate thread an optional cls parameter so ManagedMemoryResource.allocate materializes a ManagedBuffer.
  • Internal _LocSpec (in _managed_location.py) carries the (kind, id) discriminator that the Cython layer maps to CUmemLocation (CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see only Device / Host / int; _coerce_location produces the internal record.
  • _buffer.pyx collapses out.is_managed = (is_managed != 0) to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured by CU_POINTER_ATTRIBUTE_IS_MANAGED.
  • The defensive cuInit retry in _query_memory_attrs was removed; we don't auto-init CUDA elsewhere.

Tests

Managed-memory tests live in cuda_core/tests/memory/test_managed_ops.py: free-function dispatch (single + batched + mismatch + non-managed rejection), Host constructors and frozen-dataclass semantics, internal _coerce_location for Device | Host | int | None, full ManagedBuffer property roundtrips (read_mostly, preferred_location, accessed_by add/discard/assignment), and instance methods. The broader memory-tests reorg (buffer / managed_resource / pinned / vmm "siblings") is tracked as a separate cleanup PR.

Deferred follow-ups

  • HMM/ATS-aware is_managed semantics — flagged as a TODO in _buffer.pyx, tracked alongside the broader HMM/ATS work.
  • cuMemcpyBatchAsync (P0 of Support batched memory movement #1333) — different family, separate PR; will mirror the contract in #issuecomment-4355502334.
  • Concrete fields on the *Options dataclasses — they're empty today; concrete options land when CUDA introduces per-call flags worth surfacing.
  • CUDA 13 split-attribute read-back for preferred_location / accessed_by — currently uses the legacy combined attribute (Python binding limitation), which loses NUMA fidelity on round-trip. Setters preserve full NUMA info.
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Mar 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rparolin rparolin requested a review from Andy-Jost March 17, 2026 00:41
@rparolin rparolin self-assigned this Mar 17, 2026
@rparolin rparolin added this to the cuda.core v0.7.0 milestone Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:45
@rparolin rparolin marked this pull request as draft March 17, 2026 00:45
@rparolin rparolin changed the title wip Mar 17, 2026
@rparolin rparolin marked this pull request as ready for review March 17, 2026 00:57
@rparolin
Copy link
Copy Markdown
Collaborator Author

/ok to test

@jrhemstad
Copy link
Copy Markdown

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented Mar 17, 2026

question: Does making these member functions of the Buffer type preclude this functionality for allocations that weren't created through the Buffer type? Did we consider making these free functions instead of member functions on the Buffer type?

I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore.

@rparolin rparolin marked this pull request as draft March 17, 2026 19:35
@rparolin rparolin marked this pull request as ready for review March 17, 2026 23:46
rparolin and others added 7 commits March 17, 2026 17:30
…ups, fix docs

- Remove duplicate long-form "cu_mem_advise_*" string aliases from
  _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly
- Replace 4 boolean allow_* params in _normalize_managed_location with a
  single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES
- Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag,
  discard_prefetch support, and advice enum-to-alias reverse map
- Collapse hasattr+getattr to single getattr in _managed_location_enum
- Move _require_managed_discard_prefetch_support to top of discard_prefetch
  for fail-fast behavior
- Fix docs build: reset Sphinx module scope after managed_memory section in
  api.rst so subsequent sections resolve under cuda.core
- Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path

The _V2_BINDINGS cache in _buffer.pyx persists across tests, so
monkeypatching get_binding_version alone is insufficient when earlier
tests have already populated the cache with the v2 value. Promote
_V2_BINDINGS from cdef int to a Python-level variable so tests can
monkeypatch it directly via monkeypatch.setattr, and reset it to -1
in both legacy-signature tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware

These three tests call cuMemAdvise on real CUDA devices and verify
memory range attributes. On devices without concurrent_managed_access
(e.g. Windows/WDDM), set_read_mostly silently no-ops and
set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the
stricter _skip_if_managed_location_ops_unsupported guard, matching the
pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support

Reorder checks in discard_prefetch so _normalize_managed_target_range
runs before _require_managed_discard_prefetch_support. This ensures
non-managed buffers raise ValueError before the RuntimeError for missing
cuMemDiscardAndPrefetchBatchAsync support.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module

Move advise, prefetch, and discard_prefetch functions and their helpers
out of _buffer.pyx into a new _managed_memory_ops Cython module to
improve separation of concerns. Expose _init_mem_attrs and
_query_memory_attrs as non-inline cdef functions in _buffer.pxd so the
new module can reuse them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Buffer.from_handle is now a classmethod that dispatches via cls._init,
  so subclasses inherit it: ManagedBuffer.from_handle(...) returns a
  ManagedBuffer with no override needed. Drop ManagedBuffer.from_handle.
- Hoist `advise / prefetch / discard / discard_prefetch` imports from
  per-method lazy imports to module-level (no circular import: they live
  in cuda.core._memory._managed_memory_ops, not cuda.core.utils).
- Cache the CUmem_advise and CUmem_range_attribute enum lookups at
  module level and pass enum constants directly to advise() instead of
  re-resolving from string aliases on every property write.
- Extract _query_accessed_by as a module-level helper; AccessedBySet
  delegates and the accessed_by setter calls it directly instead of
  constructing a throwaway view.
Comment thread cuda_core/cuda/core/_memory/_managed_buffer.py
Comment thread cuda_core/cuda/core/_memory/_managed_buffer.py
@leofang leofang requested review from Andy-Jost and leofang April 30, 2026 22:24
Comment thread cuda_core/cuda/core/_memory/_managed_memory_options.py Outdated
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to run. Will try to revisit tonight. I haven't done reviewing (too many lines).

Comment thread cuda_core/cuda/core/_host.py Outdated
from dataclasses import dataclass


@dataclass(frozen=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno if this is a collective agentic illusion or what, in recent PRs I've seen many data classes. Why do we need one here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I asked claude why he selected a dataclass... I'll remove that decorator.

"A. Reply only — defend the dataclass

The dataclass is doing real work, not decoration:

  • eq/hash are tested behaviors — tests/memory/test_managed_ops.py:323-327 asserts Host() == Host(), hash(Host(numa_id=1)) ==
    hash(Host(numa_id=1)), etc.
  • Host is used in set/list comparisons in _managed_buffer.py:49 ([Host() if v == -1 else Device(v) for v in raw...]) — needs hashability if
    it ever lands in a set.
  • frozen=True ensures users can't mutate a Host after stashing it on ManagedBuffer.preferred_location."
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bot reviewed and raised this idea: Host should follow Device and be a singleton class. @Andy-Jost thoughts?

Comment thread cuda_core/cuda/core/_host.py Outdated
Comment thread cuda_core/cuda/core/_memory/_buffer.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_buffer.pyx
Comment thread cuda_core/cuda/core/_memory/_managed_buffer.py Outdated
Comment thread cuda_core/cuda/core/utils.py Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_options.py Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_buffer.py
Comment thread cuda_core/cuda/core/_memory/_managed_memory_resource.pyx
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my comments are blocking. Looks ready to me.

rparolin and others added 5 commits April 30, 2026 17:26
Per Andy's review nit (PR NVIDIA#1775, _managed_memory_ops.pyx:207), replace
the manual PyMem_Malloc / PyMem_Free pattern in the three batch helpers
(_do_batch_discard, _do_batch_prefetch, _do_batch_discard_prefetch)
with libcpp.vector. RAII handles cleanup, eliminating the manual
try/finally and removing a leak window if _to_cumemlocation raised
mid-fill. Matches the precedent used in _program.pyx, _linker.pyx,
_kernel_arg_handler.pyx, _graph_node.pyx, and others.

Net change: 53 insertions, 85 deletions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y_memory_attrs (R4)

Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:455), restore the auto-init
retry that was removed in 10de998. cuPointerGetAttributes is the
first driver call _query_memory_attrs makes, and a NOT_INITIALIZED
result here would otherwise propagate out of every is_managed /
is_host_accessible / is_device_accessible query before the user has
called any other Device API.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_host.py:9), drop the @DataClass(frozen=True)
in favor of a hand-written class with property accessors. Matches Leo's
original sketch from the 2026-04-28 drive-by comment and aligns with
how Device is structured in this codebase.

Behavior preserved: Host(), Host(numa_id=N), and Host.numa_current()
all work identically. __eq__, __hash__, and immutability are
hand-rolled rather than dataclass-generated.

is_numa_current is no longer an __init__ kwarg — it's internal state
settable only via the Host.numa_current() classmethod. Two existing
TestHost cases updated:
  - test_numa_current_with_id_rejected → test_numa_current_only_via_classmethod
  - test_frozen → test_immutable (AttributeError instead of FrozenInstanceError)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (R6, R8)

Per Leo's review on PR NVIDIA#1775 (_managed_buffer.py:165) and Andy's
parallel question (line 144), drop the `int` shorthand for
prefetch/discard_prefetch/advise locations. The previous design
accepted `Device | Host | int` where `int >= 0` meant a device ordinal
and `-1` magically meant host. With first-class `Device` and `Host`,
the int form was redundant and the `-1 → Host` magic was surprising.

Public API change:
  prefetch(buf, Device(0), stream=...)   # was: prefetch(buf, 0, stream=...)
  prefetch(buf, Host(),    stream=...)   # was: prefetch(buf, -1, stream=...)

This also resolves an inconsistency: ManagedBuffer.preferred_location
already accepted only Device | Host | None, but prefetch() and
discard_prefetch() accepted int. Now uniformly Device | Host.

Pre-1.0 breaking change. Anyone using the int shorthand should switch
to the explicit Device(N) / Host() form.

Files touched:
- _managed_location.py: drop the int branch from _coerce_location;
  TypeError now reads "Device, Host, or None"
- _managed_buffer.py: type signatures `Device | Host | int` → `Device | Host`
- _managed_memory_ops.pyx: docstring updates (3 occurrences)
- tests/memory/test_managed_ops.py: replace int call sites with
  Host()/Device(N); collapse three int-branch tests into one
  test_int_rejected
- 1.0.0-notes.rst: drop the "int values are also accepted" sentence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Andy's review on PR NVIDIA#1775 (_managed_buffer.py:52), document
`AccessedBySet` in the private API reference. It is returned by
`ManagedBuffer.accessed_by` but not directly instantiable by users —
matches the existing `_memory._ipc.*` entries in the same section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-level review focusing on DRY, verbosity, and test coverage — complementing the design-level comments already on the PR.

Summary of inline comments:

  1. _do_batch_prefetch / _do_batch_discard_prefetch are copy-pasted (~40 lines) — parameterize into one function
  2. Options isinstance check repeated 4x — extract a one-liner helper; also consolidate the near-identical prefetch / discard_prefetch preambles
  3. CUDA 12 batch fallback — loop over singles instead of NotImplementedError (the batch semantics are documented as equivalent to individual calls)
  4. _normalize_managed_advice over-engineered — the alias dict + lazy reverse dict can be replaced with a getattr on the naming convention (~15 lines saved)
  5. Test setup boilerplate — ~25 tests repeat the same 5-line preamble; a pytest fixture would save ~75 lines
  6. Test helper duplicated_get_mem_range_attr in the test is identical to _get_int_attr in production code
  7. Test coverage gaps — CUDA 12 batch fallback, AccessedBySet iteration, stream=None, error message assertions
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Outdated
Comment thread cuda_core/cuda/core/_memory/_managed_memory_ops.pyx


def test_managed_memory_prefetch_supports_managed_pool_allocations(init_cuda):
device = Device()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Nearly every test function in this file repeats the same 5-line preamble:

device = Device()
_skip_if_managed_location_ops_unsupported(device)  # or variant
device.set_current()
mr = create_managed_memory_resource_or_skip()
buffer = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE)

This appears in ~25 tests. A @pytest.fixture would eliminate this boilerplate and save ~75 lines:

@pytest.fixture
def managed_buffer(init_cuda):
    device = Device()
    _skip_if_managed_location_ops_unsupported(device)
    device.set_current()
    mr = create_managed_memory_resource_or_skip()
    buf = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE)
    yield buf
    buf.close()
    mr.close()

Tests that need a different skip level (e.g. _skip_if_managed_discard_prefetch_unsupported) could use a second fixture or parametrize. The local _skip_if_* helpers also partially overlap with conftest's skip_if_managed_memory_unsupported — worth consolidating.

Comment thread cuda_core/tests/memory/test_managed_ops.py Outdated
Comment thread cuda_core/tests/memory/test_managed_ops.py
rparolin and others added 10 commits April 30, 2026 18:27
…red_location (R2, R7)

Per Leo's questions on PR NVIDIA#1775 (_host.py:26 and _managed_buffer.py:140):

R2 (Host numa_id): the dataclass surface is intentional. Three forms
already cover the use cases — Host() / Host(numa_id=N) /
Host.numa_current(). Auto-inferring numa_id at Host() construction
would conflict with the "generic host" semantic.

R7 (preferred_location getter): the underlying limitation is real but
upstream-blocked. The legacy CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION
returns only a single int (device id, -1 host, -2 none) — no NUMA. CUDA
13 added _PREFERRED_LOCATION_TYPE / _ID for full round-trip, and they
are exposed in cydriver, but cuda.bindings'
_HelperCUmem_range_attribute does not yet recognize them — calling
driver.cuMemRangeGetAttribute with the new attributes raises
"Unsupported attribute". Once cuda.bindings adds them, this getter can
query the v2 attributes and return Host(numa_id=N).

Add a docstring note documenting the limitation so users aren't
surprised by the lossy round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…12, R13)

Per Andy's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:102 and :118),
replace `isinstance(x, (list, tuple))` with `isinstance(x, Sequence)`
in `_coerce_buffer_targets` and `_broadcast_locations`. Matches the
existing precedent in `cuda.core._utils.cuda_utils.is_sequence()`.

The widened input set also accepts `str`, but neither `Buffer` nor
`Location` is stringly-typed, so a `str` input still raises — just
with a different message (Buffer cast error or Location TypeError
from `_coerce_location`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:135), make
Buffer.from_handle a @staticmethod that always returns Buffer.
Subclass-aware construction stays available via the private
@classmethod Buffer._init, which is what Leo asked for ("use a
private method for handling subclasses for now").

ManagedBuffer gains its own @classmethod from_handle that wraps
cls._init, so user-facing call sites like
ManagedBuffer.from_handle(ptr, size, owner=plain) continue to work
unchanged. The narrowly-scoped subclass factory is on the subclass
itself, not bolted onto Buffer's public surface.

This addresses R3's spirit: cuda.core's public APIs no longer
advertise generic subclass-construction support that conflicts
with the broader subclassing story tracked in NVIDIA#750 / NVIDIA#1989.

No test changes; behavior preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's R11 ("if we prefer methods, don't expose free functions"):
each managed-memory operation now has exactly one public surface,
chosen by whether it acts on one buffer or many.

Single buffer (instance methods + properties on ManagedBuffer):
- buf.read_mostly = True
- buf.preferred_location = Device(0)
- buf.accessed_by.add(Device(1))
- buf.prefetch(Device(0), stream=stream)
- buf.discard(stream=stream)
- buf.discard_prefetch(Device(0), stream=stream)

Multiple buffers (free functions in cuda.core.utils, CUDA 13+ only):
- utils.prefetch_batch(buffers, locations, stream=stream)
- utils.discard_batch(buffers, stream=stream)
- utils.discard_prefetch_batch(buffers, locations, stream=stream)

Removed:
- cuda.core.utils.advise / prefetch / discard / discard_prefetch
  (single-buffer surfaces — replaced by ManagedBuffer methods/properties)
- cuda.core._memory._managed_memory_options module and its four empty
  AdviseOptions / PrefetchOptions / DiscardOptions /
  DiscardPrefetchOptions dataclasses (R9 from Leo, R10 from Andy:
  empty placeholders that didn't carry information)
- options=None parameter from every public surface
- The single-buffer fast path inside the now-batched-only free
  functions; they always hit cuMem*BatchAsync now

Internals:
- Public def advise() deleted; _advise_one (cdef) is the new internal
  single-buffer entry point used by ManagedBuffer property setters.
- Three new Python-level wrappers _do_single_prefetch_py /
  _do_single_discard_py / _do_single_discard_prefetch_py used by
  ManagedBuffer instance methods. These call the cdef _do_single_*
  helpers with the right Cython types after stream coercion.
- _coerce_buffer_targets renamed to _coerce_batch_buffers; rejects a
  single Buffer with a TypeError pointing at the ManagedBuffer method.

Tests:
- TestPrefetch / TestDiscard / TestDiscardPrefetch / TestAdvise
  rewritten as TestPrefetchBatch / TestDiscardBatch /
  TestDiscardPrefetchBatch (batched-only, since single-buffer is
  covered by ManagedBuffer's TestManagedBuffer class)
- Single-buffer external-allocation tests use
  ManagedBuffer.from_handle(plain.handle, plain.size, owner=plain)
  to wrap a DummyUnifiedMemoryResource buffer
- options-related tests deleted (no options surface to test)
- enum-value advise test deleted (property setters are typed; the
  string-alias / enum-value internal API isn't user-visible)

Release notes updated.

Closes R9, R10, R11.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ad (N4)

Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:23), drop the
lazy-init plumbing for the enum→alias reverse lookup table. The forward
table _MANAGED_ADVICE_ALIASES has six entries; building the inverse at
module load via a dict comprehension is the same data without the
mutable-global pattern, the `if None` check, or the `global` declaration
inside the function body.

Forward lookup table (_MANAGED_ADVICE_ALIASES) is preserved as the source
of truth — explicit alias→CUDA-name mapping, grep-friendly, no implicit
naming-convention coupling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d_prefetch} (N2)

Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:425), the two
batched-with-locations helpers were byte-for-byte identical except for
the driver function being called. Both:
- declare the same four std::vectors (ptrs, sizes, loc_arr, loc_indices)
- resize and fill them in the same loop
- release the GIL and call cuMem{Prefetch,DiscardAndPrefetch}BatchAsync
  with the same argument shape

Introduce a function-pointer typedef _BatchPrefetchFn (the two driver
calls share signature), parameterize the shared body as
_do_batch_prefetch_op, and have the two callers pass the appropriate
driver function. Both the typedef and the helper live inside the
IF CUDA_CORE_BUILD_MAJOR >= 13 block since they reference cu13-only
types.

Net: -28 lines duplication, +25 for the shared helper. No behavior
change; tests unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts (N6)

Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:28), the test
file's _get_mem_range_attr / _get_int_mem_range_attr / the local
_MEM_RANGE_ATTRIBUTE_VALUE_SIZE constant are functionally identical
to the production _get_int_attr in _managed_buffer.py. Drop the
duplicates and import the production helper.

14 call sites updated. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:228), raising
NotImplementedError on cu12 forces users to write their own loop. The
CUDA driver semantics for cuMemPrefetchBatchAsync are equivalent to
per-range cuMemPrefetchAsync calls — just more efficient when batched
at the driver level.

On cu12 builds (where cuMemPrefetchBatchAsync is not exposed), fall
back to a Python-level loop calling cuMemPrefetchAsync per buffer.
The single-range path (_do_single_prefetch) already works on cu12
via the IF/ELSE split inside it.

Note this fallback applies only to prefetch_batch — discard_batch and
discard_prefetch_batch keep the cu12 NotImplementedError because the
driver has no single-range cuMemDiscard{,AndPrefetch}Async to fall
back to.

Test skips for cuMemPrefetchBatchAsync unavailability dropped from
TestPrefetchBatch.test_same_location and test_per_buffer_location;
the fallback path now runs on cu12 builds too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:1), add a test for
the read side of AccessedBySet: __iter__, __len__, __eq__, __repr__.
These are part of the public set-like API (alongside __contains__,
add(), discard(), and the setter, which are already covered) but
were untested.

The cu12 batch fallback path (Leo's other coverage point) is now
exercised by TestPrefetchBatch.test_same_location and
test_per_buffer_location running on cu12 CI — the
cuMemPrefetchBatchAsync skip was dropped in d75a7bd when the
fallback landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation (N8)

Per the self-promised reply on PR NVIDIA#1775's R7 thread, fulfill the
Host(numa_id=N) round-trip on CUDA 13 builds.

The blocker before was that cuda.bindings's Python-level
cuMemRangeGetAttribute wrapper rejects the new
CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION_TYPE / _ID attributes via
its allowlist. The workaround: call cydriver.cuMemRangeGetAttribute
directly from a new Cython helper _read_preferred_location_v2,
bypassing the Python wrapper.

The helper queries TYPE then ID, then decodes the (kind, id) pair into
Device | Host | Host(numa_id=N) | Host.numa_current() | None.

ManagedBuffer.preferred_location getter dispatches to the v2 path on
binding_version() >= (13, 0, 0); falls back to the legacy single-int
attribute on cu12 (no NUMA info available).

Test:
- TestManagedBuffer.test_preferred_location_roundtrip already exercises
  the cu13 v2 path for Device(...) and Host() (no NUMA), which now
  passes through _read_preferred_location_v2.
- New test_preferred_location_roundtrip_host_numa exercises Host(numa_id=0)
  round-trip; skips on cu12, and also skips on cu13 hardware/drivers
  where set_preferred_location with HOST_NUMA is not preserved (e.g.
  single-NUMA test machines).

ManagedBuffer class docstring updated to reflect the cu12-only
limitation note.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 1, 2026

@leofang addressed all your feedback.

@rparolin
Copy link
Copy Markdown
Collaborator Author

rparolin commented May 1, 2026

Code review

Found 2 issues:

  1. cuda_core/docs/source/api.rst documents 8 names that don't exist anywhere in the codebase: free functions advise, prefetch, discard, discard_prefetch and dataclasses AdviseOptions, PrefetchOptions, DiscardOptions, DiscardPrefetchOptions. The actual exports from cuda.core.utils are only prefetch_batch, discard_batch, discard_prefetch_batch (plus the pre-existing StridedMemoryView and args_viewable_as_strided_memory). Sphinx will fail to resolve these autosummary entries. Looks like leftover from the earlier API surface that was removed under R9/R11; release notes were updated, api.rst was missed.

https://github.com/NVIDIA/cuda-python/blob/b0d1a216e3932468d3801da5d83449465b3f8faf/cuda_core/docs/source/api.rst#L248-L268

  1. The CUDA 12 ELSE fallbacks for the single-buffer advise and prefetch paths in _managed_memory_ops.pyx reference cydriver.cuMemAdvise and cydriver.cuMemPrefetchAsync with the legacy v1 4-argument int-device signature, but those public cydriver wrappers are only emitted when cuMemAdvise_v2 / cuMemPrefetchAsync_v2 are present in the toolkit headers (CUDA 13+). On a cu12 build the symbols don't exist (see cuda_bindings/cuda/bindings/cydriver.pxd.in:3840 and cydriver.pyx.in:1237-1247, both gated on 'cuMem*_v2' in found_functions), so the ELSE branch will fail to compile / import. The batched paths handle cu12 via a NotImplementedError + per-range loop, but the single-buffer paths assume a v1 wrapper that cuda-python never exposes.

cu_loc.type = cydriver.CUmemLocationType.CU_MEM_LOCATION_TYPE_HOST
cu_loc.id = 0
else:
cu_loc = _to_cumemlocation(loc)
with nogil:
HANDLE_RETURN(cydriver.cuMemAdvise(cu_ptr, nbytes, advice_enum, cu_loc))
ELSE:
cdef int dev_int = -1 if loc is None else _to_legacy_device(loc)
with nogil:
HANDLE_RETURN(cydriver.cuMemAdvise(cu_ptr, nbytes, advice_enum, dev_int))
def prefetch_batch(buffers, locations, *, stream):
"""Prefetch a batch of managed-memory ranges to target locations.
Requires CUDA 13+. For a single buffer, use
:meth:`ManagedBuffer.prefetch` instead.
Parameters
----------
buffers : Sequence[:class:`Buffer`]
Two or more managed allocations to operate on.
locations : :class:`~cuda.core.Device` | :class:`~cuda.core.Host` | Sequence[...]
Target location(s). A single location applies to all buffers; a
sequence must match ``len(buffers)``.
stream : :class:`~_stream.Stream` | :class:`~graph.GraphBuilder`
Stream for the asynchronous prefetch (keyword-only).
Notes
-----
On a CUDA 12 build, falls back to a Python-level loop calling
``cuMemPrefetchAsync`` per buffer (no batched driver entry point on
CUDA 12). CUDA 13 builds use ``cuMemPrefetchBatchAsync`` directly.
"""
cdef tuple bufs = _coerce_batch_buffers(buffers, "prefetch_batch")
cdef Py_ssize_t n = len(bufs)
cdef tuple locs = _broadcast_locations(locations, n, False, "prefetch_batch")
cdef Stream s = Stream_accept(stream)
cdef Buffer buf
for buf in bufs:
_require_managed_buffer(buf, "prefetch_batch")
_do_batch_prefetch(bufs, locs, s)
def _do_single_prefetch_py(Buffer buf, location, stream):
"""Internal: single-buffer prefetch for ManagedBuffer.prefetch().
Uses cuMemPrefetchAsync (works on CUDA 12 and 13).
"""
_require_managed_buffer(buf, "prefetch")
cdef object loc = _coerce_location(location, allow_none=False)
cdef Stream s = Stream_accept(stream)
_do_single_prefetch(buf, loc, s)
cdef void _do_single_prefetch(Buffer buf, object loc, Stream s):
cdef cydriver.CUdeviceptr cu_ptr = as_cu(buf._h_ptr)
cdef size_t nbytes = buf._size
cdef cydriver.CUstream hstream = as_cu(s._h_stream)
IF CUDA_CORE_BUILD_MAJOR >= 13:
cdef cydriver.CUmemLocation cu_loc = _to_cumemlocation(loc)
with nogil:
HANDLE_RETURN(cydriver.cuMemPrefetchAsync(cu_ptr, nbytes, cu_loc, 0, hstream))
ELSE:
cdef int dev_int = _to_legacy_device(loc)
with nogil:
HANDLE_RETURN(cydriver.cuMemPrefetchAsync(cu_ptr, nbytes, dev_int, hstream))

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P1 Medium priority - Should do

5 participants