The device runtime is an optional resource stack inside xlog-cuda. It gives runtime-backed providers a stream-aware allocator, a process-wide device-memory budget, block-use tracking, and deferred free/reap semantics for non-default stream execution. Default providers can still use the simpler memory manager. Use the device runtime when you need recorded launches, cross-stream dependency tracking, or a shared device budget across long-lived workloads.

Resource Stack

The core trait is DeviceMemoryResource. Implementations compose as decorators:
LayerRole
GlobalDeviceBudgetEnforces a configured byte limit and reports remaining budget.
LoggingResourceEmits allocation/deallocation telemetry when configured.
AsyncCudaResourceAllocates and frees with CUDA async APIs, tracks live and pending bytes, and records stream events.
StreamPoolProvides bounded reusable CUDA streams for recorded operations.
The runtime singleton, XlogDeviceRuntime, is keyed by CUDA ordinal and prepares the resource before first use.

Allocation Lifecycle

A block records:
  • an allocation tag describing the caller;
  • the stream that allocated it;
  • a monotonic generation used to reject stale handles;
  • the last writer event;
  • outstanding read events;
  • live and pending byte accounting.
Before a kernel uses a block on another stream, prepare_block_use waits on the events needed for safe ordering. After the kernel records a read or write, finish_block_use installs the event that future users must respect. A free can be deferred until prior stream work is complete, then reap_pending retires it. The generation check is important: if a pointer address is freed and reused, an old DeviceBlock handle does not silently mutate the new allocation.

Budget Behavior

GlobalDeviceBudget wraps an inner resource and refuses allocations that would exceed the configured limit. It distinguishes:
  • bytes reserved by live allocations;
  • bytes pending deferred free;
  • bytes still available for new work.
The runtime uses this budget to make memory pressure explicit. A route that cannot fit should return an allocation or capacity error, or decline to a fallback path when the caller defines one.

Recorded Launches

Recorded kernel paths use the runtime to preserve ordering across streams:
  1. Acquire or reuse a stream.
  2. Prepare all input and output blocks for that stream.
  3. Launch the kernel through the provider.
  4. Record the block uses.
  5. Reap pending frees when their stream work is complete.
This is the mechanism behind the runtime-backed WCOJ, groupby, join, and solver paths that need reliable non-default stream behavior.

CUDA Version

XLOG’s public release process targets NVIDIA CUDA Toolkit 13.x. The workspace currently uses the cudarc cuda-12040 feature as a driver API binding level; that is not the same thing as the toolkit requirement.

Failure Modes

Device-runtime failures should be surfaced as explicit resource errors:
  • allocation over budget;
  • stale block use after free;
  • invalid stream or generation state;
  • CUDA allocation/free failure;
  • inability to satisfy a launch dependency.
These are runtime diagnostics, not proof that a query result is correct or that an optimized route fired. Pair them with route counters and validation gates.