pub struct XlogDeviceRuntime { /* private fields */ }Expand description
Per-CUDA-ordinal device-runtime singleton.
Owns the device handle, stream pool, and resource stack. Allocate
/ deallocate calls forward to the resource. The resource is fixed
at construction (currently always DirectCudaResource); a
future commit will swap in [AsyncCudaResource] as the default
while keeping the direct backend reachable for sanitizer mode.
Implementations§
Source§impl XlogDeviceRuntime
impl XlogDeviceRuntime
Sourcepub fn with_resource(
device: Arc<CudaDevice>,
device_ordinal: u32,
stream_pool: Arc<StreamPool>,
resource: Box<dyn DeviceMemoryResource + Send + Sync>,
) -> Self
pub fn with_resource( device: Arc<CudaDevice>, device_ordinal: u32, stream_pool: Arc<StreamPool>, resource: Box<dyn DeviceMemoryResource + Send + Sync>, ) -> Self
Compose an owned runtime around a caller-supplied resource
stack. Not a singleton — the returned value is not
stored in [RUNTIMES] and does not interact with try_get.
Intended uses:
- Tests that need to drive a specific backend (e.g.,
AsyncCudaResource) through the same facade production code uses, instead of constructing the resource directly. - Future decorator stacks (
LoggingResource,GlobalDeviceBudget,DebugGuardResource) that wrap the base resource before installation.
The device and stream_pool arguments must be consistent
with device_ordinal (the pool must be bound to the same
device handle, and the device must be the one the resource
allocates against). The constructor does not verify this —
callers that compose mismatched parts get undefined
runtime-level behavior, but the per-resource device-ordinal
check on deallocate will still surface obvious mistakes as
ResourceError::Driver.
The singleton path remains Self::try_get, which today
always installs the cudarc default (non-pooled) backend
(DirectCudaResource). Swapping the singleton’s default
resource is a separate later change gated on
GlobalDeviceBudget and LoggingResource landing.
Sourcepub fn try_get(ordinal: u32) -> Result<&'static XlogDeviceRuntime>
pub fn try_get(ordinal: u32) -> Result<&'static XlogDeviceRuntime>
Get the singleton for ordinal, initializing it on first
access. Subsequent calls return the same &'static.
Errors:
XlogError::Kernelifordinal >= MAX_DEVICE_ORDINALS.XlogError::Kernelif the CUDA device cannot be opened.
Concurrency: at most one thread builds the runtime for a
given ordinal. Other concurrent first callers block on the
per-ordinal init mutex until the winner publishes via
OnceLock::set, after which they observe the published
runtime via the inside-mutex double-check or the lock-free
fast path on subsequent calls.
Sourcepub fn device_ordinal(&self) -> u32
pub fn device_ordinal(&self) -> u32
CUDA ordinal this runtime serves.
Sourcepub fn device(&self) -> &Arc<CudaDevice>
pub fn device(&self) -> &Arc<CudaDevice>
Borrow the device handle.
Sourcepub fn stream_pool(&self) -> &Arc<StreamPool>
pub fn stream_pool(&self) -> &Arc<StreamPool>
Borrow the stream pool.
Sourcepub fn allocate(
&self,
bytes: usize,
stream: StreamId,
tag: AllocTag,
) -> ResourceResult<DeviceBlock>
pub fn allocate( &self, bytes: usize, stream: StreamId, tag: AllocTag, ) -> ResourceResult<DeviceBlock>
Allocate via the underlying resource. Stream-ordered: the
returned DeviceBlock is bound to stream.
Sourcepub fn deallocate(&self, block: DeviceBlock) -> ResourceResult<()>
pub fn deallocate(&self, block: DeviceBlock) -> ResourceResult<()>
Deallocate via the underlying resource.
Sourcepub fn bytes_outstanding(&self) -> usize
pub fn bytes_outstanding(&self) -> usize
Sum of bytes currently outstanding on this device, as reported by the underlying resource. Used by the global-budget adaptor (later commit) and the parallel-stress acceptance test.
Sourcepub fn reap_pending(&self) -> ResourceResult<()>
pub fn reap_pending(&self) -> ResourceResult<()>
Drain pending async frees on the underlying resource. No-op
for synchronous backends. Callers that need an accurate
bytes_outstanding reading after a burst of asynchronous
deallocations should call this first.
Sourcepub fn record_block_use(
&self,
block: &DeviceBlock,
use_stream: StreamId,
) -> ResourceResult<()>
pub fn record_block_use( &self, block: &DeviceBlock, use_stream: StreamId, ) -> ResourceResult<()>
Record that work has been (or is being) submitted on
use_stream that touches block. Forwards to the
underlying resource stack
(GlobalDeviceBudget → LoggingResource → AsyncCudaResource),
where the stream-ordered backend attaches a CUDA event so
block.alloc_stream waits on it before the queued
cuMemFreeAsync runs. This is the production-reachable
hook the future xlog launch builder will call for
read / write / read_write buffer args; until that
lands, callers that submit raw CUDA work on a stream
other than block.alloc_stream should call this directly.
See DeviceMemoryResource::record_block_use for the
underlying contract.
Sourcepub fn supports_block_use_tracking(&self) -> bool
pub fn supports_block_use_tracking(&self) -> bool
Whether the active resource stack tracks cross-stream
uses (i.e., supports record_block_use). The launch
recorder’s preflight checks this BEFORE queuing CUDA
work, so a misconfigured runtime fails loudly at the
boundary rather than after the launch is in flight.
Sourcepub fn prepare_block_use(
&self,
block: BlockId,
use_stream: StreamId,
access: Access,
) -> ResourceResult<()>
pub fn prepare_block_use( &self, block: BlockId, use_stream: StreamId, access: Access, ) -> ResourceResult<()>
Pre-launch hook: queue cross-stream waits required for
use_stream to safely access block with access
semantics. MUST be called BEFORE the GPU work is enqueued
on use_stream. Forwards to the resource stack; see
DeviceMemoryResource::prepare_block_use for the
underlying contract.
Sourcepub fn finish_block_use(
&self,
block: BlockId,
use_stream: StreamId,
access: Access,
) -> ResourceResult<()>
pub fn finish_block_use( &self, block: BlockId, use_stream: StreamId, access: Access, ) -> ResourceResult<()>
Post-launch hook: record an event on use_stream
capturing the work just enqueued and update block’s
dependency state. MUST be called AFTER the launch /
copy is queued. Forwards to the resource stack; see
DeviceMemoryResource::finish_block_use for the
underlying contract.
Sourcepub fn prepare_first_use<T: DeviceRepr>(
&self,
slice: &TrackedCudaSlice<T>,
use_stream: StreamId,
access: Access,
) -> ResourceResult<()>
pub fn prepare_first_use<T: DeviceRepr>( &self, slice: &TrackedCudaSlice<T>, use_stream: StreamId, access: Access, ) -> ResourceResult<()>
Convenience for helper-internal scratch allocations that
will be immediately written / read on use_stream.
Looks up the BlockId from the slice’s runtime block
and calls Self::prepare_block_use with access. Use
this directly after GpuMemoryManager::alloc when the
buffer’s first cross-stream consumer is the same operator
(e.g., a hash-table bucket array memset on launch_stream
against a buffer freshly allocated on the manager’s
default stream).
Returns Err(ResourceError::StreamMisuse) if slice is
not runtime-backed — strict callers should ensure their
memory manager carries a runtime.
Sourcepub fn finish_first_use<T: DeviceRepr>(
&self,
slice: &TrackedCudaSlice<T>,
use_stream: StreamId,
access: Access,
) -> ResourceResult<()>
pub fn finish_first_use<T: DeviceRepr>( &self, slice: &TrackedCudaSlice<T>, use_stream: StreamId, access: Access, ) -> ResourceResult<()>
Convenience for helper-internal scratch finish: looks up
the BlockId from the slice and forwards to
Self::finish_block_use.