Skip to main content

Module launch

Module launch 

Source
Expand description

Launch / use recorder for runtime-backed buffers.

Closes the production-side of the cross-stream lifetime gap identified by A4 and the use-after-prior-write hazard discovered by the multi-threaded sort+hash-join regression. Code that enqueues kernels or copies on a launch_stream other than the buffer’s alloc_stream MUST tell the runtime about the use BEFORE the launch (so prior cross-stream waits can be queued ahead of the work) AND AFTER the launch (so a use-event is recorded for future readers / writers and for the eventual deallocate).

Without preflight, the CUDA mempool is free to reuse the address while the cross-stream work is still in flight, AND prior writes / reads on a different stream remain invisible to the new work — kernels read torn state.

§Modes

Two construction modes:

  • LaunchRecorder::new_permissive — silently skips buffers that have no runtime-side identity (legacy cudarc-backed TrackedCudaSlice, external Dlpack / ArrowDevice columns). Intended for low-level helpers during the migration window where mixed legacy/runtime calls are unavoidable. Not safe for production migrated paths — silent skips are silent gaps.

  • LaunchRecorder::new_strict — rejects any buffer that cannot be tracked. Intended for production migrated launch paths: any buffer the recorder cannot attach an event to is a structural problem the caller must fix (route the allocation through the runtime, or refuse external memory in this code path).

§Preflight + commit

Production callers split the recorder into TWO phases around the actual CUDA call:

  1. Build the recorder, register every buffer the launch will touch via read / write / read_write / read_column / write_column before enqueueing any CUDA work. Fresh output buffers go through the same write / write_column API — there is no separate post-launch path. The recorder snapshots the block id at record time and immediately drops the slice borrow, so callers can take &mut afterwards.
  2. Call LaunchRecorder::preflight BEFORE enqueueing any CUDA work. Preflight verifies the active resource supports cross-stream tracking and (in strict mode) that every recorded buffer has a runtime block, then queues the cross-stream waits required by each recorded access kind via crate::device_runtime::XlogDeviceRuntime::prepare_block_use. On failure no CUDA work has been queued yet.
  3. Enqueue the CUDA call on launch_stream.
  4. Call LaunchRecorder::commit AFTER the launch is enqueued. Commit calls finish_block_use on each tracked block — the runtime records its event on launch_stream at this point, and that event becomes part of the block’s dependency state for future readers / writers and the eventual deallocate.

§Why preflight queues waits, not just validates

Earlier revisions only validated the resource stack at preflight and queued waits implicitly via deallocate. That protected free-after-use but NOT use-after-prior-write across streams: if sort writes column A on stream X and join reads column A on stream Y, the join’s read kernel could observe sort’s pre-write contents because no event fenced X→Y. This recorder closes that gap by queuing cuStreamWaitEvent calls in preflight, before the join kernel is enqueued on Y, against sort’s recorded write event on X.

§External memory (DLPack, Arrow device)

Strict mode rejects crate::memory::CudaColumn::Dlpack and crate::memory::CudaColumn::ArrowDevice columns outright. External memory has no xlog-side runtime identity — the prepare/finish APIs cannot attach events to a buffer the runtime did not allocate. Callers that need to consume external columns must either:

  • use a permissive recorder (and accept that no cross-stream safety applies to those buffers), or
  • synchronize externally (e.g., wait on the producing framework’s stream / event before queueing xlog work).

Permissive mode skips external columns silently, matching the legacy-buffer policy.

Structs§

LaunchRecorder
Records buffer uses for a single launch / copy on launch_stream. Drop without commit is a programmer error; the recorder logs (debug builds only) and never panics.

Enums§

RecorderMode
Recorder construction mode.