xlog-cuda. It gives
runtime-backed providers a stream-aware allocator, a process-wide device-memory
budget, block-use tracking, and deferred free/reap semantics for non-default
stream execution.
Default providers can still use the simpler memory manager. Use the device
runtime when you need recorded launches, cross-stream dependency tracking, or a
shared device budget across long-lived workloads.
Resource Stack
The core trait isDeviceMemoryResource. Implementations compose as decorators:
| Layer | Role |
|---|---|
GlobalDeviceBudget | Enforces a configured byte limit and reports remaining budget. |
LoggingResource | Emits allocation/deallocation telemetry when configured. |
AsyncCudaResource | Allocates and frees with CUDA async APIs, tracks live and pending bytes, and records stream events. |
StreamPool | Provides bounded reusable CUDA streams for recorded operations. |
XlogDeviceRuntime, is keyed by CUDA ordinal and prepares
the resource before first use.
Allocation Lifecycle
A block records:- an allocation tag describing the caller;
- the stream that allocated it;
- a monotonic generation used to reject stale handles;
- the last writer event;
- outstanding read events;
- live and pending byte accounting.
prepare_block_use waits on the
events needed for safe ordering. After the kernel records a read or write,
finish_block_use installs the event that future users must respect. A free can
be deferred until prior stream work is complete, then reap_pending retires it.
The generation check is important: if a pointer address is freed and reused, an
old DeviceBlock handle does not silently mutate the new allocation.
Budget Behavior
GlobalDeviceBudget wraps an inner resource and refuses allocations that would
exceed the configured limit. It distinguishes:
- bytes reserved by live allocations;
- bytes pending deferred free;
- bytes still available for new work.
Recorded Launches
Recorded kernel paths use the runtime to preserve ordering across streams:- Acquire or reuse a stream.
- Prepare all input and output blocks for that stream.
- Launch the kernel through the provider.
- Record the block uses.
- Reap pending frees when their stream work is complete.
CUDA Version
XLOG’s public release process targets NVIDIA CUDA Toolkit 13.x. The workspace currently uses thecudarc cuda-12040 feature as a driver API binding level;
that is not the same thing as the toolkit requirement.
Failure Modes
Device-runtime failures should be surfaced as explicit resource errors:- allocation over budget;
- stale block use after free;
- invalid stream or generation state;
- CUDA allocation/free failure;
- inability to satisfy a launch dependency.