Expand description
AsyncCudaResource — stream-ordered allocation backed by
cudarc’s CudaStream::alloc (which forwards to cuMemAllocAsync
when the context supports it).
Each DeviceMemoryResource::allocate call resolves the
caller-supplied StreamId to a live cudarc::driver::CudaStream
via the StreamPool, allocates against that stream, and stores
the resulting CudaSlice<u8> in the resource’s live map. Drop on
deallocate invokes cuMemFreeAsync (when supported) on the same
stream the allocation was bound to.
This backend is the production candidate. It is not the
sanitizer/cert backend — pool/async behavior can hide byte-level
out-of-bounds patterns from Compute Sanitizer; the cert role
belongs to [DirectCudaResource] (subject to manual Compute Sanitizer
confirmation on a supported host).
§Stream-ordering contract enforced here
allocate(.., stream, ..)is ordered on the resolvedCudaStream. The returnedDeviceBlockcarries the samealloc_stream.deallocate(block)releases the underlying memory ordered on the block’salloc_stream. Callers must have synchronized any work on a different stream before deallocation.- Reuse of the underlying byte address by a future
allocateis ordered after the previous deallocate by the CUDA driver’s stream-ordered memory allocator semantics. The stream-ordered allocation lifetime regression test encodes this.
§bytes_outstanding and pending-free accounting
The trait contract is “live + retired-but-not-yet-freed”. A queued
cuMemFreeAsync is “retired-but-not-yet-freed” until the host
synchronizes the stream the free was queued on. We therefore keep
two atomic counters:
live_bytes— bytes for blocks currently in the live map.pending_bytes— bytes for blocks whoseCudaSlicehas been dropped (so acuMemFreeAsyncis queued on the alloc stream) but whose stream has not yet been synchronized by us.
bytes_outstanding() returns live_bytes + pending_bytes.
reap_pending() drains the per-stream pending map under the
per-stream mutex, synchronizes each drained stream, and then
subtracts only the synchronized total from pending_bytes
via fetch_sub — it does not zero the counter. A
deallocate that races between reap’s drain and its fetch_sub
re-populates both the per-stream map and the global atomic
together (under the same mutex), so its bytes either land
entirely before the drain (reaped this round) or entirely after
(kept for the next reap), never split.
On the first stream-sync failure, the failing entry and every
remaining un-iterated drained entry are restored into
pending_per_stream so a subsequent reap can retry them. Only
the bytes for streams that successfully synchronized are
decremented from pending_bytes. Without this recovery, a
transient driver error mid-reap would lose track of pending
bytes forever — the drained map would be gone, pending_bytes
would still count them, but no stream id would be queued for
a future reap. Production callers (GlobalDeviceBudget, the
stream-ordered allocation lifetime tests’ final assertions) thus see consistent
bytes_outstanding() even on transient sync failures.
Structs§
- Async
Cuda Resource - Stream-ordered cudarc-backed allocator.