Module async_resource

Expand description

AsyncCudaResource — stream-ordered allocation backed by cudarc’s CudaStream::alloc (which forwards to cuMemAllocAsync when the context supports it).

Each DeviceMemoryResource::allocate call resolves the caller-supplied StreamId to a live cudarc::driver::CudaStream via the StreamPool, allocates against that stream, and stores the resulting CudaSlice<u8> in the resource’s live map. Drop on deallocate invokes cuMemFreeAsync (when supported) on the same stream the allocation was bound to.

This backend is the production candidate. It is not the sanitizer/cert backend — pool/async behavior can hide byte-level out-of-bounds patterns from Compute Sanitizer; the cert role belongs to [DirectCudaResource] (subject to manual Compute Sanitizer confirmation on a supported host).

§Stream-ordering contract enforced here

allocate(.., stream, ..) is ordered on the resolved CudaStream. The returned DeviceBlock carries the same alloc_stream.
deallocate(block) releases the underlying memory ordered on the block’s alloc_stream. Callers must have synchronized any work on a different stream before deallocation.
Reuse of the underlying byte address by a future allocate is ordered after the previous deallocate by the CUDA driver’s stream-ordered memory allocator semantics. The stream-ordered allocation lifetime regression test encodes this.

§`bytes_outstanding` and pending-free accounting

The trait contract is “live + retired-but-not-yet-freed”. A queued cuMemFreeAsync is “retired-but-not-yet-freed” until the host synchronizes the stream the free was queued on. We therefore keep two atomic counters:

live_bytes — bytes for blocks currently in the live map.
pending_bytes — bytes for blocks whose CudaSlice has been dropped (so a cuMemFreeAsync is queued on the alloc stream) but whose stream has not yet been synchronized by us.

bytes_outstanding() returns live_bytes + pending_bytes.

reap_pending() drains the per-stream pending map under the per-stream mutex, synchronizes each drained stream, and then subtracts only the synchronized total from pending_bytes via fetch_sub — it does not zero the counter. A deallocate that races between reap’s drain and its fetch_sub re-populates both the per-stream map and the global atomic together (under the same mutex), so its bytes either land entirely before the drain (reaped this round) or entirely after (kept for the next reap), never split.

On the first stream-sync failure, the failing entry and every remaining un-iterated drained entry are restored into pending_per_stream so a subsequent reap can retry them. Only the bytes for streams that successfully synchronized are decremented from pending_bytes. Without this recovery, a transient driver error mid-reap would lose track of pending bytes forever — the drained map would be gone, pending_bytes would still count them, but no stream id would be queued for a future reap. Production callers (GlobalDeviceBudget, the stream-ordered allocation lifetime tests’ final assertions) thus see consistent bytes_outstanding() even on transient sync failures.

Structs§

AsyncCudaResource: Stream-ordered cudarc-backed allocator.