Expand description
Launch / use recorder for runtime-backed buffers.
Closes the production-side of the cross-stream lifetime gap
identified by A4 and the use-after-prior-write hazard
discovered by the multi-threaded sort+hash-join regression.
Code that enqueues kernels or copies on a launch_stream
other than the buffer’s alloc_stream MUST tell the runtime
about the use BEFORE the launch (so prior cross-stream waits
can be queued ahead of the work) AND AFTER the launch (so a
use-event is recorded for future readers / writers and for
the eventual deallocate).
Without preflight, the CUDA mempool is free to reuse the address while the cross-stream work is still in flight, AND prior writes / reads on a different stream remain invisible to the new work — kernels read torn state.
§Modes
Two construction modes:
-
LaunchRecorder::new_permissive— silently skips buffers that have no runtime-side identity (legacy cudarc-backedTrackedCudaSlice, externalDlpack/ArrowDevicecolumns). Intended for low-level helpers during the migration window where mixed legacy/runtime calls are unavoidable. Not safe for production migrated paths — silent skips are silent gaps. -
LaunchRecorder::new_strict— rejects any buffer that cannot be tracked. Intended for production migrated launch paths: any buffer the recorder cannot attach an event to is a structural problem the caller must fix (route the allocation through the runtime, or refuse external memory in this code path).
§Preflight + commit
Production callers split the recorder into TWO phases around the actual CUDA call:
- Build the recorder, register every buffer the launch
will touch via
read/write/read_write/read_column/write_columnbefore enqueueing any CUDA work. Fresh output buffers go through the samewrite/write_columnAPI — there is no separate post-launch path. The recorder snapshots the block id at record time and immediately drops the slice borrow, so callers can take&mutafterwards. - Call
LaunchRecorder::preflightBEFORE enqueueing any CUDA work. Preflight verifies the active resource supports cross-stream tracking and (in strict mode) that every recorded buffer has a runtime block, then queues the cross-stream waits required by each recorded access kind viacrate::device_runtime::XlogDeviceRuntime::prepare_block_use. On failure no CUDA work has been queued yet. - Enqueue the CUDA call on
launch_stream. - Call
LaunchRecorder::commitAFTER the launch is enqueued. Commit callsfinish_block_useon each tracked block — the runtime records its event onlaunch_streamat this point, and that event becomes part of the block’s dependency state for future readers / writers and the eventual deallocate.
§Why preflight queues waits, not just validates
Earlier revisions only validated the resource stack at
preflight and queued waits implicitly via deallocate.
That protected free-after-use but NOT use-after-prior-write
across streams: if sort writes column A on stream X and
join reads column A on stream Y, the join’s read kernel
could observe sort’s pre-write contents because no event
fenced X→Y. This recorder closes that gap by queuing
cuStreamWaitEvent calls in preflight, before the join
kernel is enqueued on Y, against sort’s recorded write
event on X.
§External memory (DLPack, Arrow device)
Strict mode rejects crate::memory::CudaColumn::Dlpack
and crate::memory::CudaColumn::ArrowDevice columns
outright. External memory has no xlog-side runtime identity
— the prepare/finish APIs cannot attach events to a buffer
the runtime did not allocate. Callers that need to consume
external columns must either:
- use a permissive recorder (and accept that no cross-stream safety applies to those buffers), or
- synchronize externally (e.g., wait on the producing framework’s stream / event before queueing xlog work).
Permissive mode skips external columns silently, matching the legacy-buffer policy.
Structs§
- Launch
Recorder - Records buffer uses for a single launch / copy on
launch_stream. Drop withoutcommitis a programmer error; the recorder logs (debug builds only) and never panics.
Enums§
- Recorder
Mode - Recorder construction mode.