Add managed-memory advise, prefetch, and discard-prefetch free functions#1775
Add managed-memory advise, prefetch, and discard-prefetch free functions#1775rparolin wants to merge 62 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
/ok to test |
|
question: Does making these member functions of the |
I'm moving this back into draft. We discussed in our team meeting because I was already hesitant as Buffer is becoming a 'God object' with the functionality is gaining. We were going to explore alternatives. Free functions sounds like a good alternative to explore. |
…ns in the cuda.core.managed_memory namespace
…ups, fix docs - Remove duplicate long-form "cu_mem_advise_*" string aliases from _MANAGED_ADVICE_ALIASES; users pass short strings or the enum directly - Replace 4 boolean allow_* params in _normalize_managed_location with a single allowed_loctypes frozenset driven by _MANAGED_ADVICE_ALLOWED_LOCTYPES - Cache immutable runtime checks: CU_DEVICE_CPU, v2 bindings flag, discard_prefetch support, and advice enum-to-alias reverse map - Collapse hasattr+getattr to single getattr in _managed_location_enum - Move _require_managed_discard_prefetch_support to top of discard_prefetch for fail-fast behavior - Fix docs build: reset Sphinx module scope after managed_memory section in api.rst so subsequent sections resolve under cuda.core - Add discard_prefetch pool-allocation test and comment on _get_mem_range_attr Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e legacy path The _V2_BINDINGS cache in _buffer.pyx persists across tests, so monkeypatching get_binding_version alone is insufficient when earlier tests have already populated the cache with the v2 value. Promote _V2_BINDINGS from cdef int to a Python-level variable so tests can monkeypatch it directly via monkeypatch.setattr, and reset it to -1 in both legacy-signature tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t real hardware These three tests call cuMemAdvise on real CUDA devices and verify memory range attributes. On devices without concurrent_managed_access (e.g. Windows/WDDM), set_read_mostly silently no-ops and set_preferred_location fails with CUDA_ERROR_INVALID_DEVICE. Use the stricter _skip_if_managed_location_ops_unsupported guard, matching the pattern already used by test_managed_memory_functions_accept_raw_pointer_ranges. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s support Reorder checks in discard_prefetch so _normalize_managed_target_range runs before _require_managed_discard_prefetch_support. This ensures non-managed buffers raise ValueError before the RuntimeError for missing cuMemDiscardAndPrefetchBatchAsync support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps module Move advise, prefetch, and discard_prefetch functions and their helpers out of _buffer.pyx into a new _managed_memory_ops Cython module to improve separation of concerns. Expose _init_mem_attrs and _query_memory_attrs as non-inline cdef functions in _buffer.pxd so the new module can reuse them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Buffer.from_handle is now a classmethod that dispatches via cls._init, so subclasses inherit it: ManagedBuffer.from_handle(...) returns a ManagedBuffer with no override needed. Drop ManagedBuffer.from_handle. - Hoist `advise / prefetch / discard / discard_prefetch` imports from per-method lazy imports to module-level (no circular import: they live in cuda.core._memory._managed_memory_ops, not cuda.core.utils). - Cache the CUmem_advise and CUmem_range_attribute enum lookups at module level and pass enum constants directly to advise() instead of re-resolving from string aliases on every property write. - Extract _query_accessed_by as a module-level helper; AccessedBySet delegates and the accessed_by setter calls it directly instead of constructing a throwaway view.
leofang
left a comment
There was a problem hiding this comment.
I need to run. Will try to revisit tonight. I haven't done reviewing (too many lines).
| from dataclasses import dataclass | ||
|
|
||
|
|
||
| @dataclass(frozen=True) |
There was a problem hiding this comment.
I dunno if this is a collective agentic illusion or what, in recent PRs I've seen many data classes. Why do we need one here?
There was a problem hiding this comment.
When I asked claude why he selected a dataclass... I'll remove that decorator.
"A. Reply only — defend the dataclass
The dataclass is doing real work, not decoration:
- eq/hash are tested behaviors — tests/memory/test_managed_ops.py:323-327 asserts Host() == Host(), hash(Host(numa_id=1)) ==
hash(Host(numa_id=1)), etc. - Host is used in set/list comparisons in _managed_buffer.py:49 ([Host() if v == -1 else Device(v) for v in raw...]) — needs hashability if
it ever lands in a set. - frozen=True ensures users can't mutate a Host after stashing it on ManagedBuffer.preferred_location."
There was a problem hiding this comment.
My bot reviewed and raised this idea: Host should follow Device and be a singleton class. @Andy-Jost thoughts?
Andy-Jost
left a comment
There was a problem hiding this comment.
None of my comments are blocking. Looks ready to me.
Per Andy's review nit (PR NVIDIA#1775, _managed_memory_ops.pyx:207), replace the manual PyMem_Malloc / PyMem_Free pattern in the three batch helpers (_do_batch_discard, _do_batch_prefetch, _do_batch_discard_prefetch) with libcpp.vector. RAII handles cleanup, eliminating the manual try/finally and removing a leak window if _to_cumemlocation raised mid-fill. Matches the precedent used in _program.pyx, _linker.pyx, _kernel_arg_handler.pyx, _graph_node.pyx, and others. Net change: 53 insertions, 85 deletions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y_memory_attrs (R4) Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:455), restore the auto-init retry that was removed in 10de998. cuPointerGetAttributes is the first driver call _query_memory_attrs makes, and a NOT_INITIALIZED result here would otherwise propagate out of every is_managed / is_host_accessible / is_device_accessible query before the user has called any other Device API. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_host.py:9), drop the @DataClass(frozen=True) in favor of a hand-written class with property accessors. Matches Leo's original sketch from the 2026-04-28 drive-by comment and aligns with how Device is structured in this codebase. Behavior preserved: Host(), Host(numa_id=N), and Host.numa_current() all work identically. __eq__, __hash__, and immutability are hand-rolled rather than dataclass-generated. is_numa_current is no longer an __init__ kwarg — it's internal state settable only via the Host.numa_current() classmethod. Two existing TestHost cases updated: - test_numa_current_with_id_rejected → test_numa_current_only_via_classmethod - test_frozen → test_immutable (AttributeError instead of FrozenInstanceError) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (R6, R8) Per Leo's review on PR NVIDIA#1775 (_managed_buffer.py:165) and Andy's parallel question (line 144), drop the `int` shorthand for prefetch/discard_prefetch/advise locations. The previous design accepted `Device | Host | int` where `int >= 0` meant a device ordinal and `-1` magically meant host. With first-class `Device` and `Host`, the int form was redundant and the `-1 → Host` magic was surprising. Public API change: prefetch(buf, Device(0), stream=...) # was: prefetch(buf, 0, stream=...) prefetch(buf, Host(), stream=...) # was: prefetch(buf, -1, stream=...) This also resolves an inconsistency: ManagedBuffer.preferred_location already accepted only Device | Host | None, but prefetch() and discard_prefetch() accepted int. Now uniformly Device | Host. Pre-1.0 breaking change. Anyone using the int shorthand should switch to the explicit Device(N) / Host() form. Files touched: - _managed_location.py: drop the int branch from _coerce_location; TypeError now reads "Device, Host, or None" - _managed_buffer.py: type signatures `Device | Host | int` → `Device | Host` - _managed_memory_ops.pyx: docstring updates (3 occurrences) - tests/memory/test_managed_ops.py: replace int call sites with Host()/Device(N); collapse three int-branch tests into one test_int_rejected - 1.0.0-notes.rst: drop the "int values are also accepted" sentence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Andy's review on PR NVIDIA#1775 (_managed_buffer.py:52), document `AccessedBySet` in the private API reference. It is returned by `ManagedBuffer.accessed_by` but not directly instantiable by users — matches the existing `_memory._ipc.*` entries in the same section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
leofang
left a comment
There was a problem hiding this comment.
Code-level review focusing on DRY, verbosity, and test coverage — complementing the design-level comments already on the PR.
Summary of inline comments:
_do_batch_prefetch/_do_batch_discard_prefetchare copy-pasted (~40 lines) — parameterize into one function- Options isinstance check repeated 4x — extract a one-liner helper; also consolidate the near-identical
prefetch/discard_prefetchpreambles - CUDA 12 batch fallback — loop over singles instead of
NotImplementedError(the batch semantics are documented as equivalent to individual calls) _normalize_managed_adviceover-engineered — the alias dict + lazy reverse dict can be replaced with agetattron the naming convention (~15 lines saved)- Test setup boilerplate — ~25 tests repeat the same 5-line preamble; a pytest fixture would save ~75 lines
- Test helper duplicated —
_get_mem_range_attrin the test is identical to_get_int_attrin production code - Test coverage gaps — CUDA 12 batch fallback,
AccessedBySetiteration,stream=None, error message assertions
|
|
||
|
|
||
| def test_managed_memory_prefetch_supports_managed_pool_allocations(init_cuda): | ||
| device = Device() |
There was a problem hiding this comment.
nit: Nearly every test function in this file repeats the same 5-line preamble:
device = Device()
_skip_if_managed_location_ops_unsupported(device) # or variant
device.set_current()
mr = create_managed_memory_resource_or_skip()
buffer = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE)This appears in ~25 tests. A @pytest.fixture would eliminate this boilerplate and save ~75 lines:
@pytest.fixture
def managed_buffer(init_cuda):
device = Device()
_skip_if_managed_location_ops_unsupported(device)
device.set_current()
mr = create_managed_memory_resource_or_skip()
buf = mr.allocate(_MANAGED_TEST_ALLOCATION_SIZE)
yield buf
buf.close()
mr.close()Tests that need a different skip level (e.g. _skip_if_managed_discard_prefetch_unsupported) could use a second fixture or parametrize. The local _skip_if_* helpers also partially overlap with conftest's skip_if_managed_memory_unsupported — worth consolidating.
…red_location (R2, R7) Per Leo's questions on PR NVIDIA#1775 (_host.py:26 and _managed_buffer.py:140): R2 (Host numa_id): the dataclass surface is intentional. Three forms already cover the use cases — Host() / Host(numa_id=N) / Host.numa_current(). Auto-inferring numa_id at Host() construction would conflict with the "generic host" semantic. R7 (preferred_location getter): the underlying limitation is real but upstream-blocked. The legacy CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION returns only a single int (device id, -1 host, -2 none) — no NUMA. CUDA 13 added _PREFERRED_LOCATION_TYPE / _ID for full round-trip, and they are exposed in cydriver, but cuda.bindings' _HelperCUmem_range_attribute does not yet recognize them — calling driver.cuMemRangeGetAttribute with the new attributes raises "Unsupported attribute". Once cuda.bindings adds them, this getter can query the v2 attributes and return Host(numa_id=N). Add a docstring note documenting the limitation so users aren't surprised by the lossy round-trip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…12, R13) Per Andy's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:102 and :118), replace `isinstance(x, (list, tuple))` with `isinstance(x, Sequence)` in `_coerce_buffer_targets` and `_broadcast_locations`. Matches the existing precedent in `cuda.core._utils.cuda_utils.is_sequence()`. The widened input set also accepts `str`, but neither `Buffer` nor `Location` is stringly-typed, so a `str` input still raises — just with a different message (Buffer cast error or Location TypeError from `_coerce_location`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_buffer.pyx:135), make Buffer.from_handle a @staticmethod that always returns Buffer. Subclass-aware construction stays available via the private @classmethod Buffer._init, which is what Leo asked for ("use a private method for handling subclasses for now"). ManagedBuffer gains its own @classmethod from_handle that wraps cls._init, so user-facing call sites like ManagedBuffer.from_handle(ptr, size, owner=plain) continue to work unchanged. The narrowly-scoped subclass factory is on the subclass itself, not bolted onto Buffer's public surface. This addresses R3's spirit: cuda.core's public APIs no longer advertise generic subclass-construction support that conflicts with the broader subclassing story tracked in NVIDIA#750 / NVIDIA#1989. No test changes; behavior preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's R11 ("if we prefer methods, don't expose free functions"):
each managed-memory operation now has exactly one public surface,
chosen by whether it acts on one buffer or many.
Single buffer (instance methods + properties on ManagedBuffer):
- buf.read_mostly = True
- buf.preferred_location = Device(0)
- buf.accessed_by.add(Device(1))
- buf.prefetch(Device(0), stream=stream)
- buf.discard(stream=stream)
- buf.discard_prefetch(Device(0), stream=stream)
Multiple buffers (free functions in cuda.core.utils, CUDA 13+ only):
- utils.prefetch_batch(buffers, locations, stream=stream)
- utils.discard_batch(buffers, stream=stream)
- utils.discard_prefetch_batch(buffers, locations, stream=stream)
Removed:
- cuda.core.utils.advise / prefetch / discard / discard_prefetch
(single-buffer surfaces — replaced by ManagedBuffer methods/properties)
- cuda.core._memory._managed_memory_options module and its four empty
AdviseOptions / PrefetchOptions / DiscardOptions /
DiscardPrefetchOptions dataclasses (R9 from Leo, R10 from Andy:
empty placeholders that didn't carry information)
- options=None parameter from every public surface
- The single-buffer fast path inside the now-batched-only free
functions; they always hit cuMem*BatchAsync now
Internals:
- Public def advise() deleted; _advise_one (cdef) is the new internal
single-buffer entry point used by ManagedBuffer property setters.
- Three new Python-level wrappers _do_single_prefetch_py /
_do_single_discard_py / _do_single_discard_prefetch_py used by
ManagedBuffer instance methods. These call the cdef _do_single_*
helpers with the right Cython types after stream coercion.
- _coerce_buffer_targets renamed to _coerce_batch_buffers; rejects a
single Buffer with a TypeError pointing at the ManagedBuffer method.
Tests:
- TestPrefetch / TestDiscard / TestDiscardPrefetch / TestAdvise
rewritten as TestPrefetchBatch / TestDiscardBatch /
TestDiscardPrefetchBatch (batched-only, since single-buffer is
covered by ManagedBuffer's TestManagedBuffer class)
- Single-buffer external-allocation tests use
ManagedBuffer.from_handle(plain.handle, plain.size, owner=plain)
to wrap a DummyUnifiedMemoryResource buffer
- options-related tests deleted (no options surface to test)
- enum-value advise test deleted (property setters are typed; the
string-alias / enum-value internal API isn't user-visible)
Release notes updated.
Closes R9, R10, R11.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ad (N4) Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:23), drop the lazy-init plumbing for the enum→alias reverse lookup table. The forward table _MANAGED_ADVICE_ALIASES has six entries; building the inverse at module load via a dict comprehension is the same data without the mutable-global pattern, the `if None` check, or the `global` declaration inside the function body. Forward lookup table (_MANAGED_ADVICE_ALIASES) is preserved as the source of truth — explicit alias→CUDA-name mapping, grep-friendly, no implicit naming-convention coupling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d_prefetch} (N2) Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:425), the two batched-with-locations helpers were byte-for-byte identical except for the driver function being called. Both: - declare the same four std::vectors (ptrs, sizes, loc_arr, loc_indices) - resize and fill them in the same loop - release the GIL and call cuMem{Prefetch,DiscardAndPrefetch}BatchAsync with the same argument shape Introduce a function-pointer typedef _BatchPrefetchFn (the two driver calls share signature), parameterize the shared body as _do_batch_prefetch_op, and have the two callers pass the appropriate driver function. Both the typedef and the helper live inside the IF CUDA_CORE_BUILD_MAJOR >= 13 block since they reference cu13-only types. Net: -28 lines duplication, +25 for the shared helper. No behavior change; tests unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts (N6) Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:28), the test file's _get_mem_range_attr / _get_int_mem_range_attr / the local _MEM_RANGE_ATTRIBUTE_VALUE_SIZE constant are functionally identical to the production _get_int_attr in _managed_buffer.py. Drop the duplicates and import the production helper. 14 call sites updated. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (_managed_memory_ops.pyx:228), raising NotImplementedError on cu12 forces users to write their own loop. The CUDA driver semantics for cuMemPrefetchBatchAsync are equivalent to per-range cuMemPrefetchAsync calls — just more efficient when batched at the driver level. On cu12 builds (where cuMemPrefetchBatchAsync is not exposed), fall back to a Python-level loop calling cuMemPrefetchAsync per buffer. The single-range path (_do_single_prefetch) already works on cu12 via the IF/ELSE split inside it. Note this fallback applies only to prefetch_batch — discard_batch and discard_prefetch_batch keep the cu12 NotImplementedError because the driver has no single-range cuMemDiscard{,AndPrefetch}Async to fall back to. Test skips for cuMemPrefetchBatchAsync unavailability dropped from TestPrefetchBatch.test_same_location and test_per_buffer_location; the fallback path now runs on cu12 builds too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Leo's review on PR NVIDIA#1775 (test_managed_ops.py:1), add a test for the read side of AccessedBySet: __iter__, __len__, __eq__, __repr__. These are part of the public set-like API (alongside __contains__, add(), discard(), and the setter, which are already covered) but were untested. The cu12 batch fallback path (Leo's other coverage point) is now exercised by TestPrefetchBatch.test_same_location and test_per_buffer_location running on cu12 CI — the cuMemPrefetchBatchAsync skip was dropped in d75a7bd when the fallback landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation (N8) Per the self-promised reply on PR NVIDIA#1775's R7 thread, fulfill the Host(numa_id=N) round-trip on CUDA 13 builds. The blocker before was that cuda.bindings's Python-level cuMemRangeGetAttribute wrapper rejects the new CU_MEM_RANGE_ATTRIBUTE_PREFERRED_LOCATION_TYPE / _ID attributes via its allowlist. The workaround: call cydriver.cuMemRangeGetAttribute directly from a new Cython helper _read_preferred_location_v2, bypassing the Python wrapper. The helper queries TYPE then ID, then decodes the (kind, id) pair into Device | Host | Host(numa_id=N) | Host.numa_current() | None. ManagedBuffer.preferred_location getter dispatches to the v2 path on binding_version() >= (13, 0, 0); falls back to the legacy single-int attribute on cu12 (no NUMA info available). Test: - TestManagedBuffer.test_preferred_location_roundtrip already exercises the cu13 v2 path for Device(...) and Host() (no NUMA), which now passes through _read_preferred_location_v2. - New test_preferred_location_roundtrip_host_numa exercises Host(numa_id=0) round-trip; skips on cu12, and also skips on cu13 hardware/drivers where set_preferred_location with HOST_NUMA is not preserved (e.g. single-NUMA test machines). ManagedBuffer class docstring updated to reflect the cu12-only limitation note. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@leofang addressed all your feedback. |
Code reviewFound 2 issues:
cuda-python/cuda_core/cuda/core/_memory/_managed_memory_ops.pyx Lines 255 to 325 in b0d1a21 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Summary
Adds managed-memory range operations to
cuda.core:cuda.core.utils:advise,prefetch,discard,discard_prefetch. Each accepts either a singleBufferor a sequence; N==1 dispatches to the per-range driver entry point and N>1 dispatches to the correspondingcuMem*BatchAsync(CUDA 13+).Host— new top-level class symmetric toDevice.Host()(any host),Host(numa_id=N),Host.numa_current(). Used together withDeviceto express managed-memory locations.ManagedBuffer—Buffersubclass returned byManagedMemoryResource.allocate. Exposes a Pythonic property-style advice API on top of the same free functions. Wrap an external managed pointer withBuffer.from_handle(...)(now a@classmethod, soManagedBuffer.from_handle(...)returns aManagedBuffer).*Optionsdataclasses —AdviseOptions,PrefetchOptions,DiscardOptions,DiscardPrefetchOptions. Frozen dataclasses reserved for future per-call flags; current ABI has no flags worth surfacing, but the dataclasses establish the contract so future flags land without an API break.Closes #1332. Addresses the managed-memory portion of #1333 (P1:
cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync). The P0cuMemcpyBatchAsyncfrom #1333 is intentionally out of scope and tracked separately; the holistic batched-API contract this PR commits to is documented in #issuecomment-4355502334 so the upcomingcuMemcpyBatchAsyncwork can mirror it.Public API
ManagedBuffer— property-style advice on managed allocationsManagedMemoryResource.allocatereturns aManagedBuffer(aBuffersubclass). All ManagedBuffer-specific behavior is layered on top of the free functions, so the two surfaces stay consistent.Free functions —
advise/prefetch/discard/discard_prefetchEach accepts a
Buffer(orManagedBuffer) or a sequence of them. Locations are expressed viaDevice,Host, orint(-1→ host,>=0→ device ordinal).Batched form — same function, sequence of targets
When N>1, dispatch goes to the corresponding
cuMem*BatchAsync. Sequence locations are paired by index; a scalar location broadcasts to every target.Mismatched sequence lengths raise
ValueError. On a CUDA 12 build ofcuda.core, N>1 raisesNotImplementedError(the*BatchAsyncentry points are CUDA 13+); N==1 works on every supported toolkit.Putting it together
Implementation notes
cuda_core/cuda/core/_memory/_managed_memory_ops.pyxusescimport cydriverfor direct C-level driver calls.cuMemAdviseandcuMemPrefetchAsyncis handled at compile time withIF CUDA_CORE_BUILD_MAJOR >= 13:/ELSE:.cuMemPrefetchBatchAsync,cuMemDiscardBatchAsync,cuMemDiscardAndPrefetchBatchAsync) are CUDA 13+ only. On CUDA 12 builds, N>1 calls raiseNotImplementedError; single-buffer calls work everywhere.ManagedBufferis a pure-Python subclass of the CythonBuffercdef class.Buffer.from_handleis now a@classmethod(was@staticmethod) soMyBufferSubclass.from_handle(...)returns the typed instance viacls._init.Buffer_from_deviceptr_handleand_MP_allocatethread an optionalclsparameter soManagedMemoryResource.allocatematerializes aManagedBuffer._LocSpec(in_managed_location.py) carries the(kind, id)discriminator that the Cython layer maps toCUmemLocation(CUDA 13) or a legacy device ordinal (CUDA 12). Public callers see onlyDevice/Host/int;_coerce_locationproduces the internal record._buffer.pyxcollapsesout.is_managed = (is_managed != 0)to a single unconditional assignment and adds a TODO noting that HMM/ATS-mapped sysmem is not yet captured byCU_POINTER_ATTRIBUTE_IS_MANAGED.cuInitretry in_query_memory_attrswas removed; we don't auto-init CUDA elsewhere.Tests
Managed-memory tests live in
cuda_core/tests/memory/test_managed_ops.py: free-function dispatch (single + batched + mismatch + non-managed rejection),Hostconstructors and frozen-dataclass semantics, internal_coerce_locationforDevice | Host | int | None, fullManagedBufferproperty roundtrips (read_mostly,preferred_location,accessed_byadd/discard/assignment), and instance methods. The broader memory-tests reorg (buffer / managed_resource / pinned / vmm "siblings") is tracked as a separate cleanup PR.Deferred follow-ups
is_managedsemantics — flagged as aTODOin_buffer.pyx, tracked alongside the broader HMM/ATS work.cuMemcpyBatchAsync(P0 of Support batched memory movement #1333) — different family, separate PR; will mirror the contract in #issuecomment-4355502334.*Optionsdataclasses — they're empty today; concrete options land when CUDA introduces per-call flags worth surfacing.preferred_location/accessed_by— currently uses the legacy combined attribute (Python binding limitation), which loses NUMA fidelity on round-trip. Setters preserve full NUMA info.