pub struct CudaKernelProvider { /* private fields */ }Expand description
CUDA kernel provider for xlog GPU operations
Manages pre-compiled PTX modules for relational operations:
- Join: Hash join with build/probe phases
- Dedup: Sort-based deduplication with prefix-sum compaction
- GroupBy: Sorted-input group aggregation (count, sum, min, max)
PTX modules are loaded at construction time and stored in the CUDA device.
Kernel functions can be retrieved using device.get_func().
§Example
use std::sync::Arc;
use xlog_cuda::{CudaDevice, GpuMemoryManager, CudaKernelProvider};
use xlog_core::MemoryBudget;
let device = Arc::new(CudaDevice::new(0)?);
let memory = Arc::new(GpuMemoryManager::new(device.clone(), MemoryBudget::default()));
let provider = CudaKernelProvider::new(device, memory)?;Implementations§
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn to_dlpack_table(&self, buffer: CudaBuffer) -> DlpackTable
pub fn to_dlpack_table(&self, buffer: CudaBuffer) -> DlpackTable
Convert a CudaBuffer into a DLPack-exportable table without device↔host copies.
Export each column with DlpackTable::column(...).
Sourcepub fn from_dlpack_tensors(
&self,
tensors: Vec<DlpackManagedTensor>,
) -> Result<CudaBuffer>
pub fn from_dlpack_tensors( &self, tensors: Vec<DlpackManagedTensor>, ) -> Result<CudaBuffer>
Import one DLPack tensor per column as a zero-copy CudaBuffer.
The returned buffer owns the DLPack tensors and will call their deleters on drop.
Sourcepub fn from_dlpack_tensors_with_schema(
&self,
schema: Schema,
tensors: Vec<DlpackManagedTensor>,
) -> Result<CudaBuffer>
pub fn from_dlpack_tensors_with_schema( &self, schema: Schema, tensors: Vec<DlpackManagedTensor>, ) -> Result<CudaBuffer>
Import DLPack column tensors with an explicit schema (type-checked).
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn create_constant_column(
&self,
value_bytes: &[u8],
col_type: ScalarType,
num_rows: u64,
) -> Result<CudaBuffer>
pub fn create_constant_column( &self, value_bytes: &[u8], col_type: ScalarType, num_rows: u64, ) -> Result<CudaBuffer>
Sourcepub fn create_constant_column_with_device_count(
&self,
value_bytes: &[u8],
col_type: ScalarType,
row_cap: u64,
d_num_rows_src: &TrackedCudaSlice<u32>,
) -> Result<CudaBuffer>
pub fn create_constant_column_with_device_count( &self, value_bytes: &[u8], col_type: ScalarType, row_cap: u64, d_num_rows_src: &TrackedCudaSlice<u32>, ) -> Result<CudaBuffer>
Create a constant column sized to row_cap while preserving device row count from d_num_rows_src.
Sourcepub fn add_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn add_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise addition of two single-column buffers
Performs element-wise addition using GPU kernels. Uses wrapping arithmetic for integer overflow.
§Arguments
a- First operand buffer (single column)b- Second operand buffer (single column)
§Returns
A new CudaBuffer containing the element-wise sum
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn sub_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn sub_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise subtraction of two single-column buffers
Performs element-wise subtraction using GPU kernels. Uses wrapping arithmetic for integer overflow.
§Arguments
a- First operand buffer (single column)b- Second operand buffer (single column)
§Returns
A new CudaBuffer containing the element-wise difference
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn mul_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn mul_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise multiplication of two single-column buffers
Performs element-wise multiplication using GPU kernels. Uses wrapping arithmetic for integer overflow.
§Arguments
a- First operand buffer (single column)b- Second operand buffer (single column)
§Returns
A new CudaBuffer containing the element-wise product
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn div_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn div_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise division of two single-column buffers
Performs element-wise division using GPU kernels. For signed integers, division by zero returns i64::MAX/i32::MAX. For unsigned integers, division by zero returns u64::MAX/u32::MAX. For floats, division by zero produces Inf/NaN as per IEEE 754.
§Arguments
a- Dividend buffer (single column)b- Divisor buffer (single column)
§Returns
A new CudaBuffer containing the element-wise quotient
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn mod_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn mod_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise modulo of two single-column buffers
Performs element-wise modulo using GPU kernels. For integers, modulo by zero returns 0. For floats, modulo by zero produces NaN as per IEEE 754.
§Arguments
a- Dividend buffer (single column)b- Divisor buffer (single column)
§Returns
A new CudaBuffer containing the element-wise remainder
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn abs_column(&self, a: &CudaBuffer) -> Result<CudaBuffer>
pub fn abs_column(&self, a: &CudaBuffer) -> Result<CudaBuffer>
Element-wise absolute value of a single-column buffer
Performs element-wise absolute value using GPU kernels.
§Arguments
a- Input buffer (single column)
§Returns
A new CudaBuffer containing the absolute values
§Errors
Returns XlogError::Kernel if:
- Buffer is not single-column
- Type is not supported for arithmetic
Sourcepub fn min_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn min_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise minimum of two single-column buffers
Performs element-wise minimum using GPU kernels.
§Arguments
a- First operand buffer (single column)b- Second operand buffer (single column)
§Returns
A new CudaBuffer containing the element-wise minimums
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn max_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn max_columns(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Element-wise maximum of two single-column buffers
Performs element-wise maximum using GPU kernels.
§Arguments
a- First operand buffer (single column)b- Second operand buffer (single column)
§Returns
A new CudaBuffer containing the element-wise maximums
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn pow_columns(
&self,
base: &CudaBuffer,
exp: &CudaBuffer,
) -> Result<CudaBuffer>
pub fn pow_columns( &self, base: &CudaBuffer, exp: &CudaBuffer, ) -> Result<CudaBuffer>
Element-wise power of two single-column buffers
Converts both operands to f64, computes x^y on the GPU, and returns f64 result. This matches the behavior of most database systems where pow() returns a float.
§Arguments
base- Base values buffer (single column)exp- Exponent values buffer (single column)
§Returns
A new CudaBuffer containing the element-wise powers as f64
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Type is not supported for arithmetic
Sourcepub fn select_columns(
&self,
mask: &CudaBuffer,
then_vals: &CudaBuffer,
else_vals: &CudaBuffer,
) -> Result<CudaBuffer>
pub fn select_columns( &self, mask: &CudaBuffer, then_vals: &CudaBuffer, else_vals: &CudaBuffer, ) -> Result<CudaBuffer>
Conditional select between two single-column buffers based on a boolean mask.
For each row: out[i] = mask[i] ? then_vals[i] : else_vals[i]
§Arguments
mask- Boolean mask buffer (single column, type Bool/u8)then_vals- Values to select when mask is trueelse_vals- Values to select when mask is false
§Returns
A new CudaBuffer with values selected based on the mask
§Errors
Returns XlogError::Kernel if:
- Row counts don’t match
- Buffers are not single-column
- Types of then/else values don’t match
Sourcepub fn cast_column(
&self,
a: &CudaBuffer,
target: ScalarType,
) -> Result<CudaBuffer>
pub fn cast_column( &self, a: &CudaBuffer, target: ScalarType, ) -> Result<CudaBuffer>
Cast a single-column buffer to a different type
Casts data on the GPU using the arithmetic cast kernel.
§Arguments
a- Input buffer (single column)target- Target scalar type
§Returns
A new CudaBuffer with the cast values
§Errors
Returns XlogError::Kernel if:
- Buffer is not single-column
- Source or target type is not supported for casting
Sourcepub fn combine_columns(
&self,
columns: Vec<CudaBuffer>,
types: Vec<ScalarType>,
) -> Result<CudaBuffer>
pub fn combine_columns( &self, columns: Vec<CudaBuffer>, types: Vec<ScalarType>, ) -> Result<CudaBuffer>
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn compare_columns<T: GpuScalar>(
&self,
input: &CudaBuffer,
left: usize,
right: usize,
op: CompareOp,
) -> Result<TrackedCudaSlice<u8>>
pub fn compare_columns<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, ) -> Result<TrackedCudaSlice<u8>>
Generic compare-columns: produce a device mask for left <op> right.
Replaces: compare_columns_u32, compare_columns_i32, compare_columns_i64,
compare_columns_u64, compare_columns_f32, compare_columns_f64,
compare_columns_u8.
Sourcepub fn filter<T: GpuScalar>(
&self,
input: &CudaBuffer,
col: usize,
value: T,
op: CompareOp,
) -> Result<CudaBuffer>
pub fn filter<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, ) -> Result<CudaBuffer>
Generic filter: keep rows where column[col] <op> value.
Dispatches between fused compare+scan+compact (u32, f64) and mask+compact (all other types).
Replaces: filter_u32, filter_f64, filter_i32, filter_u64,
filter_f32, filter_bool.
Sourcepub fn filter_recorded<T: GpuScalar>(
&self,
input: &CudaBuffer,
col: usize,
value: T,
op: CompareOp,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn filter_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder, end-to-end variant of Self::filter —
the first composed migrated DATA path.
Composes Self::compare_const_mask_recorded and
Self::compact_buffer_by_device_mask_counted_recorded
on a single launch_stream. Each primitive builds its
own crate::launch::LaunchRecorder, records its uses,
preflights, runs its kernels, and commits independently.
Composition correctness rests on the runtime’s
“record-all, wait-all” semantics: every
record_block_use call APPENDS a fresh event to the
live entry’s last_use_events: Vec<CudaEvent>, and
deallocate waits on EVERY event in that vector before
queueing cuMemFreeAsync. So the compare’s commit and
the compact’s later commit each push their own event
for input.column[i] (and other shared buffers), and
the deallocate gates the free behind both — closing
the cross-stream lifetime gap end-to-end. (Latest-event
coalescing per (block, launch_stream) is a possible
future optimization; today every recorded use is
retained and waited on.)
§Dispatch
For types with a fused filter_compare_*_scan_phase1
kernel (u32, f64), routes to
Self::filter_fused_scan_recorded — single-pass
compare+scan+compact mirror of the legacy fast path.
For all other types, composes
Self::compare_const_mask_recorded +
Self::compact_buffer_by_device_mask_counted_recorded.
§Errors
Propagates the structured XlogError::Kernel errors
produced by either underlying recorded primitive
(legacy manager, unresolved launch_stream, external
column, preflight / commit failures, kernel launch
failures, cu_stream.synchronize() before host scalar
read).
Sourcepub fn filter_fused_scan_recorded<T: GpuScalar>(
&self,
input: &CudaBuffer,
col: usize,
value: T,
op: CompareOp,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn filter_fused_scan_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder variant of Self::filter_fused_scan —
the migrated fused compare+scan+compact fast path for
u32 and f64.
Mirrors the legacy chain on a single explicit
launch_stream:
filter_compare_T_scan_phase1— fused compare + block-local scan that producesd_mask,d_prefix_sum,d_block_sumsin one launch.- When
num_blocks > 1,multiblock_scan_u32_inplace_on_streamond_block_sumsfollowed bymultiblock_scan_phase3to propagate block offsets intod_prefix_sum. capture_compact_count— writesd_out_countfor the masked total.cu_stream.synchronize()— explicitly orders the host scalar read ofd_out_countagainst the pending capture kernel.dtoh_scalar_untracked(&d_out_count, 0)→output_rows.- Per-input-column
compact_bytes_by_maskon the samelaunch_stream.
§Strict-mode contract
Identical to
Self::compact_buffer_by_device_mask_counted_recorded:
input.num_rows_device() and every input.column(i)
recorded as reads BEFORE preflight; every fresh
runtime-backed allocation (d_mask, d_prefix_sum,
d_block_sums, d_out_count, each dst_col) recorded
via write BEFORE preflight; the recorder snapshots
block identity at record time and drops the source
borrow, so kernel &mut borrows after preflight remain
valid before the kernels enqueue.
§Panics
T::filter_scan_phase1_kernel() must be Some —
callers should only reach this method for u32 / f64.
Sourcepub fn filter_columns_recorded<T: GpuScalar>(
&self,
input: &CudaBuffer,
left: usize,
right: usize,
op: CompareOp,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn filter_columns_recorded<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder, end-to-end variant of column-column
filter: keep rows where column[left] <op> column[right].
Composes Self::compare_columns_mask_recorded and
Self::compact_buffer_by_device_mask_counted_recorded
on a single launch_stream. Same composition contract
as Self::filter_recorded: each primitive builds its
own recorder and commits independently; the runtime
appends every recorded event to last_use_events, and
deallocate waits on every event, so input columns
referenced by BOTH the compare AND the per-column
compacts are correctly gated end-to-end.
§Errors
Propagates the structured XlogError::Kernel errors
produced by either underlying recorded primitive
(legacy manager, unresolved launch_stream, external
column on either input side, preflight / commit
failures, kernel launch failures,
cu_stream.synchronize() before host scalar read).
Sourcepub fn compare_const_mask_recorded<T: GpuScalar>(
&self,
input: &CudaBuffer,
col: usize,
value: T,
op: CompareOp,
launch_stream: StreamId,
) -> Result<TrackedCudaSlice<u8>>
pub fn compare_const_mask_recorded<T: GpuScalar>( &self, input: &CudaBuffer, col: usize, value: T, op: CompareOp, launch_stream: StreamId, ) -> Result<TrackedCudaSlice<u8>>
Strict-recorder variant of Self::compare_const_mask.
Runs the filter compare kernel on the caller-supplied
launch_stream and threads the column read through the
runtime via LaunchRecorder. This is the second
migrated launch path (after memset_recorded) and the
first kernel-driven one — it is intentionally a sibling
of the legacy Self::compare_const_mask rather than a
replacement. Existing callers stay on the legacy path
until the broader filter migration lands.
§Strict-mode contract
- Requires the provider’s manager to be built via
crate::GpuMemoryManager::with_runtime; otherwise returnsXlogError::Kernelbefore any allocation. input.column(col)is recorded as a read; external (CudaColumn::Dlpack/CudaColumn::ArrowDevice) columns are rejected at preflight, before the kernel is enqueued.d_maskis freshly allocated through the same runtime-backed manager. By construction itsruntime_block()isSome, so its write recording cannot strict-reject. The write is therefore noted AFTER the kernel is enqueued — this sidesteps the borrow conflict between&mut d_mask(cudarc kernel param) and&d_mask(recorder). A future migration that may write to a buffer of unknown provenance must instead capture identity pre-launch (e.g. via a raw view) so strict rejection happens at preflight.
§Errors
XlogError::Kernelif the manager has no runtime, or iflaunch_streamdoes not resolve.XlogError::Kernelfrom preflight (external column, unsupported active resource).XlogError::Kernelfrom the underlying CUDA launch.XlogError::Kernelfrom commit on transientrecord_block_usefailure.
Sourcepub fn compare_columns_mask_recorded<T: GpuScalar>(
&self,
input: &CudaBuffer,
left: usize,
right: usize,
op: CompareOp,
launch_stream: StreamId,
) -> Result<TrackedCudaSlice<u8>>
pub fn compare_columns_mask_recorded<T: GpuScalar>( &self, input: &CudaBuffer, left: usize, right: usize, op: CompareOp, launch_stream: StreamId, ) -> Result<TrackedCudaSlice<u8>>
Strict-recorder variant of Self::compare_columns_mask.
Runs the column-column compare kernel on the
caller-supplied launch_stream and threads BOTH column
reads through the runtime via LaunchRecorder. Sibling
of the legacy Self::compare_columns_mask; existing
callers stay on the legacy path.
§Strict-mode contract
- Requires the provider’s manager to be built via
crate::GpuMemoryManager::with_runtime; otherwise returnsXlogError::Kernelbefore any allocation. input.column(left)andinput.column(right)are both recorded as reads BEFORE preflight. External (DLPack / Arrow) columns on either side are rejected at preflight, before the kernel is enqueued.d_maskis freshly allocated by the same runtime-backed manager; its write is recorded via the standardwriteAPI BEFORE preflight (the recorder snapshots block identity, so the kernel&mut d_maskborrow after preflight is unaffected).
§Errors
XlogError::Kernelif the manager has no runtime, or iflaunch_streamdoes not resolve.XlogError::Kernelfrom preflight (external column on either side, unsupported active resource).XlogError::Kernelfrom the underlying CUDA launch.XlogError::Kernelfrom commit on transientrecord_block_usefailure.
pub fn prefix_sum_mask(&self, mask: &[u8]) -> Result<(Vec<u32>, u32)>
Sourcepub fn filter_by_mask(
&self,
input: &CudaBuffer,
mask: &[u8],
) -> Result<CudaBuffer>
pub fn filter_by_mask( &self, input: &CudaBuffer, mask: &[u8], ) -> Result<CudaBuffer>
Sourcepub fn compact_buffer_by_device_mask_counted_recorded(
&self,
input: &CudaBuffer,
d_mask: &TrackedCudaSlice<u8>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn compact_buffer_by_device_mask_counted_recorded( &self, input: &CudaBuffer, d_mask: &TrackedCudaSlice<u8>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Compact a buffer using a device-resident mask.
Computes prefix sum and output count fully on-device.
Strict-recorder variant of
Self::compact_buffer_by_device_mask_counted — the
first migrated COMPACT path.
The compact pipeline is a multi-kernel chain:
mask_clamp_rows → multiblock_scan_phase1 →
multiblock_scan_u32_inplace_on_stream (recursive,
only when num_blocks > 1) → multiblock_scan_phase3 →
capture_compact_count → host scalar read of
d_out_count → per-column compact_bytes_by_mask.
Every kernel runs on the same explicit launch_stream
via launch_on_stream, and the host scalar read at
the chain’s middle is explicitly ordered by
cu_stream.synchronize() — non-blocking streams do
NOT get default-stream implicit ordering.
§Strict-mode contract
- Requires the provider’s manager to be built via
crate::GpuMemoryManager::with_runtime; otherwise returnsXlogError::Kernelbefore any allocation. d_maskis recorded as a read.input.num_rows_device()is recorded as a read.- Each
input.column(i)is recorded as a read; external columns on any side are rejected at preflight, before any CUDA work is enqueued. - Every fresh runtime-backed allocation that this
function makes (
d_mask_clamped,d_prefix_sum,d_block_sums,d_out_count, eachdst_col) is recorded viawriteBEFORE the kernel chain enqueues. Locals that drop at end-of-scope (d_mask_clamped,d_prefix_sum,d_block_sums) stay safe because the runtime’s deallocate queuescuStreamWaitEvent(alloc_stream, recorded_event)BEFOREcuMemFreeAsync, gating the free on the launch_stream chain. - Intermediate
block_sumsallocations created by the recursive scan helper are recorded directly inside the helper (they don’t outlive the helper call).
§Errors
XlogError::Kernelif the manager has no runtime, or iflaunch_streamdoes not resolve.XlogError::Kernelfrom preflight (external column on any side, unsupported active resource).XlogError::Kernelfrom any underlying CUDA launch or from the launch_stream synchronize before the host scalar read.XlogError::Kernelfrom commit on transientrecord_block_usefailure.
pub fn compact_buffer_by_device_mask_counted( &self, input: &CudaBuffer, d_mask: &TrackedCudaSlice<u8>, ) -> Result<CudaBuffer>
pub fn filter_by_device_mask( &self, input: &CudaBuffer, d_mask: &CudaSlice<u8>, ) -> Result<CudaBuffer>
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn free_join_execute_u32_recorded(
&self,
inputs: &[&CudaBuffer],
plan: &FjPlan,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn free_join_execute_u32_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>
Execute a hand-built Free Join plan over u32/Symbol relations
via the level-synchronous frontier engine. See the module docs
for the algorithm and invariants; the plan contract is
documented on FjPlan.
Inputs are layout-normalized per dispatch (sorted + deduped via
the existing WCOJ layout entries — already-normalized inputs
take the recorded fast-path check). The output contains one
column per output_vars entry (all U32) holding the join’s
projected row set under set semantics.
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input violates the u32 width-class layout contract, the plan is invalid (unbound probe variables, over/under-consumed atom columns, rebound variables, unknown output variables), the frontier exceeds the u32 work-index space, or any kernel launch fails.
Sourcepub fn free_join_execute_u64_recorded(
&self,
inputs: &[&CudaBuffer],
plan: &FjPlan,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn free_join_execute_u64_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>
u64 width-class twin of Self::free_join_execute_u32_recorded:
identical pipeline, contract, and invariants; every
input column must be U64 and the output columns are U64.
Sourcepub fn free_join_count_by_root_u32_recorded(
&self,
inputs: &[&CudaBuffer],
plan: &FjPlan,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn free_join_count_by_root_u32_recorded( &self, inputs: &[&CudaBuffer], plan: &FjPlan, launch_stream: StreamId, ) -> Result<CudaBuffer>
Design §2.4 factorized count-by-root over the Free Join
frontier: runs the same pipeline but reduces to
(group, count) instead of materializing rows. The plan’s
output_vars must be exactly [group_var]; atoms may be
PARTIALLY consumed — each surviving frontier row contributes
the product of its remaining live trie-range lengths (the
d-representation count), so trailing private variables never
expand the frontier. Output schema: (group: U32, count: U64).
u32/Symbol width-class only: the reduction reuses the recorded groupby, whose KEY columns are bounded engine-wide to U32/Symbol (multi-type recorded sort is deferred there) — u64 bodies stay on the materialize path.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn fj_delta_novel_u32_recorded(
&self,
delta: &CudaBuffer,
edge: &CudaBuffer,
full_r: &CudaBuffer,
cols: FjDeltaCols,
domain: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn fj_delta_novel_u32_recorded( &self, delta: &CudaBuffer, edge: &CudaBuffer, full_r: &CudaBuffer, cols: FjDeltaCols, domain: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
One factorized semi-naive delta step: returns the
full-row-deduped novel set
{head(carry, value) : delta(carry, key), edge(key, value), head ∉ full_r}
with column roles given by cols (orientation- and
head-order-agnostic; the buffer is built in full_r’s schema).
edge must be layout-normalized key-first (lex-sorted,
deduped); delta and full_r are order-insensitive. All ids
must be < domain (fail-closed in-kernel check).
Sourcepub fn fj_delta_columns_max_u32(
&self,
inputs: &[(&CudaBuffer, &[usize])],
launch_stream: StreamId,
) -> Result<u32>
pub fn fj_delta_columns_max_u32( &self, inputs: &[(&CudaBuffer, &[usize])], launch_stream: StreamId, ) -> Result<u32>
Max value over the given u32/Symbol columns of the given buffers (one atomicMax kernel launch per column into a single zeroed cell). Used once per SCC fixpoint to derive the factorized-delta domain bound. Returns 0 for all-empty inputs.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn fj_delta_sparse_novel_u32_recorded(
&self,
delta: &CudaBuffer,
edge: &CudaBuffer,
full_r: &CudaBuffer,
cols: FjDeltaCols,
max_table_bytes: u64,
launch_stream: StreamId,
) -> Result<Option<CudaBuffer>>
pub fn fj_delta_sparse_novel_u32_recorded( &self, delta: &CudaBuffer, edge: &CudaBuffer, full_r: &CudaBuffer, cols: FjDeltaCols, max_table_bytes: u64, launch_stream: StreamId, ) -> Result<Option<CudaBuffer>>
Sparse-domain twin of Self::fj_delta_novel_u32_recorded: one
factorized semi-naive delta step over a hash set, with no domain
cap. Forbids the single key (u32::MAX, u32::MAX) (its packed
key+1 overflows the empty sentinel) — fails closed if present.
Returns Ok(None) when the distinct-sized hash table
(2×(|R| + distinct-candidate estimate), power of two) would
exceed max_table_bytes, or when an insert overflows an
under-sized table — both are clean route-declines so the caller
falls back to the legacy path. max_table_bytes == 0 disables
the budget guard (standalone spike/parity tests).
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn groupby_agg(
&self,
input: &CudaBuffer,
key_cols: &[usize],
agg: AggOp,
value_col: usize,
) -> Result<CudaBuffer>
pub fn groupby_agg( &self, input: &CudaBuffer, key_cols: &[usize], agg: AggOp, value_col: usize, ) -> Result<CudaBuffer>
Perform groupby aggregation
Assumes input is already sorted by key columns.
§Arguments
input- The input bufferkey_cols- Column indices for groupingagg- Aggregation operation to performvalue_col- Column index for the value to aggregate
§Returns
A buffer with one row per group, containing key columns and aggregated value
§Errors
Returns XlogError::Kernel if kernel execution fails
Sourcepub fn groupby_multi_agg(
&self,
buffer: &CudaBuffer,
key_cols: &[usize],
aggs: &[(usize, AggOp)],
) -> Result<CudaBuffer>
pub fn groupby_multi_agg( &self, buffer: &CudaBuffer, key_cols: &[usize], aggs: &[(usize, AggOp)], ) -> Result<CudaBuffer>
Multi-aggregation groupby
Performs groupby with multiple aggregation operations at once. This is more efficient than running separate groupby operations because it only sorts and computes group boundaries once.
§Arguments
buffer- The input bufferkey_cols- Column indices for grouping (currently only single-column supported)aggs- A slice of (value_col, AggOp) pairs specifying which aggregations to perform
§Returns
A buffer with one row per group, containing key columns followed by aggregated values
in the same order as the aggs parameter
§Errors
Returns XlogError::Kernel if kernel execution fails
§Example
let result = provider.groupby_multi_agg(
&buffer,
&[0], // group by column 0
&[(1, AggOp::Sum), (1, AggOp::Count), (1, AggOp::Min)],
)?;
// result has columns: key, sum, count, minSourcepub fn groupby_multi_agg_recorded(
&self,
buffer: &CudaBuffer,
key_cols: &[usize],
aggs: &[(usize, AggOp)],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn groupby_multi_agg_recorded( &self, buffer: &CudaBuffer, key_cols: &[usize], aggs: &[(usize, AggOp)], launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder variant of Self::groupby_multi_agg.
Sort + pack + boundary detect + scan + capture-num-groups
- group-id derivation + per-aggregation kernels + key
gather/unpack — every kernel runs on the caller-supplied
launch_streamvialaunch_on_stream. Composition with existing recorded primitives:sort_recordeddoes the typed multi-column sort and commits its own LaunchRecorder.pack_keys_gpu_on_streamruns the fused pack+hash kernel on launch_stream and records its buffers directly viarecord_block_use.multiblock_scan_u32_inplace_on_streamdrives the boundary-position scan tail.- The groupby-specific chain has its own LaunchRecorder for the boundary mask, group ids, group_first indices, num_groups scalar, per-aggregation outputs, and key gather/unpack outputs.
Composition correctness: each recorder commits
independently; the runtime’s record-all + wait-all
last_use_events: Vec<CudaEvent> semantics chain the
deallocate safety end-to-end across the four primitive
commits.
§Scope (narrow)
- U32 / Symbol key columns only (sort_recorded constraint).
- Aggs: Count, Sum, Min, Max. LogSumExp is rejected with a structured error — its multi-kernel chain is outside this recorded provider surface.
- Manager must be runtime-backed.
Sourcepub fn groupby_agg_recorded(
&self,
input: &CudaBuffer,
key_cols: &[usize],
agg: AggOp,
value_col: usize,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn groupby_agg_recorded( &self, input: &CudaBuffer, key_cols: &[usize], agg: AggOp, value_col: usize, launch_stream: StreamId, ) -> Result<CudaBuffer>
Convenience single-aggregation entry, mirrors
Self::groupby_agg. Forwards to
Self::groupby_multi_agg_recorded.
Source§impl CudaKernelProvider
impl CudaKernelProvider
pub fn build_selected_id_mask( &self, ids_buf: &CudaBuffer, candidate_count: usize, ) -> Result<CudaBuffer>
pub fn validate_selected_ids( &self, ids_buf: &CudaBuffer, candidate_count: usize, ) -> Result<()>
pub fn filter_buffer_by_candidate_flag( &self, input: &CudaBuffer, candidate_flags: &CudaBuffer, candidate_idx: usize, ) -> Result<CudaBuffer>
Sourcepub fn ilp_coo_fill_launch(
&self,
compacted_fact_indices: &TrackedCudaSlice<u32>,
cidx: u32,
count: u32,
offset: u32,
coo_fact: &mut TrackedCudaSlice<u32>,
coo_cand: &mut TrackedCudaSlice<u32>,
) -> Result<()>
pub fn ilp_coo_fill_launch( &self, compacted_fact_indices: &TrackedCudaSlice<u32>, cidx: u32, count: u32, offset: u32, coo_fact: &mut TrackedCudaSlice<u32>, coo_cand: &mut TrackedCudaSlice<u32>, ) -> Result<()>
Launch ilp_coo_fill kernel: writes (compacted_fact_indices[i], cidx)
pairs at coo_fact[offset..] and coo_cand[offset..].
Sourcepub fn ilp_credit_forward_f32_launch(
&self,
row_offsets: &TrackedCudaSlice<u32>,
col_indices: &TrackedCudaSlice<u32>,
cand_probs: &CudaColumn,
is_positive: &TrackedCudaSlice<u8>,
num_facts: u32,
eps: f32,
) -> Result<(TrackedCudaSlice<f32>, TrackedCudaSlice<f32>)>
pub fn ilp_credit_forward_f32_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, cand_probs: &CudaColumn, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, eps: f32, ) -> Result<(TrackedCudaSlice<f32>, TrackedCudaSlice<f32>)>
Launch ilp_credit_forward_f32: CSR credit gather + clamp + NLL loss.
Returns (credit_out, loss_contrib) device slices of length num_facts.
Sourcepub fn ilp_credit_forward_f64_launch(
&self,
row_offsets: &TrackedCudaSlice<u32>,
col_indices: &TrackedCudaSlice<u32>,
cand_probs: &CudaColumn,
is_positive: &TrackedCudaSlice<u8>,
num_facts: u32,
eps: f64,
) -> Result<(TrackedCudaSlice<f64>, TrackedCudaSlice<f64>)>
pub fn ilp_credit_forward_f64_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, cand_probs: &CudaColumn, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, eps: f64, ) -> Result<(TrackedCudaSlice<f64>, TrackedCudaSlice<f64>)>
Launch ilp_credit_forward_f64: CSR credit gather + clamp + NLL loss.
Returns (credit_out, loss_contrib) device slices of length num_facts.
Sourcepub fn ilp_credit_backward_f32_launch(
&self,
row_offsets: &TrackedCudaSlice<u32>,
col_indices: &TrackedCudaSlice<u32>,
credit_out: &TrackedCudaSlice<f32>,
is_positive: &TrackedCudaSlice<u8>,
num_facts: u32,
num_cands: u32,
) -> Result<TrackedCudaSlice<f32>>
pub fn ilp_credit_backward_f32_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, credit_out: &TrackedCudaSlice<f32>, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, num_cands: u32, ) -> Result<TrackedCudaSlice<f32>>
Launch ilp_credit_backward_f32: gradient scatter via CSR + atomicAdd.
Returns d_cand_probs gradient of length num_cands (zeroed, then accumulated).
Sourcepub fn ilp_credit_backward_f64_launch(
&self,
row_offsets: &TrackedCudaSlice<u32>,
col_indices: &TrackedCudaSlice<u32>,
credit_out: &TrackedCudaSlice<f64>,
is_positive: &TrackedCudaSlice<u8>,
num_facts: u32,
num_cands: u32,
) -> Result<TrackedCudaSlice<f64>>
pub fn ilp_credit_backward_f64_launch( &self, row_offsets: &TrackedCudaSlice<u32>, col_indices: &TrackedCudaSlice<u32>, credit_out: &TrackedCudaSlice<f64>, is_positive: &TrackedCudaSlice<u8>, num_facts: u32, num_cands: u32, ) -> Result<TrackedCudaSlice<f64>>
Launch ilp_credit_backward_f64: gradient scatter via CSR + atomicAdd.
Returns d_cand_probs gradient of length num_cands (zeroed, then accumulated).
Sourcepub fn ilp_reduce_sum_f32_launch(
&self,
input: &TrackedCudaSlice<f32>,
n: u32,
) -> Result<TrackedCudaSlice<f32>>
pub fn ilp_reduce_sum_f32_launch( &self, input: &TrackedCudaSlice<f32>, n: u32, ) -> Result<TrackedCudaSlice<f32>>
GPU-side sum reduction (f32).
Sums n elements of input on device and returns a single-element
device buffer containing the result. The caller must zero the output
buffer before launching the kernel — this function handles that.
Sourcepub fn ilp_reduce_sum_f64_launch(
&self,
input: &TrackedCudaSlice<f64>,
n: u32,
) -> Result<TrackedCudaSlice<f64>>
pub fn ilp_reduce_sum_f64_launch( &self, input: &TrackedCudaSlice<f64>, n: u32, ) -> Result<TrackedCudaSlice<f64>>
GPU-side sum reduction (f64).
Sums n elements of input on device and returns a single-element
device buffer containing the result. Requires sm_60+ for double
atomicAdd (this project targets sm_75 baseline).
Sourcepub fn ilp_coo_fill_from_mask_launch(
&self,
mask: &TrackedCudaSlice<u8>,
prefix_sum: &TrackedCudaSlice<u32>,
fact_indices: &TrackedCudaSlice<u32>,
offset_idx: u32,
cand_value: u32,
num_query: u32,
d_offsets: &TrackedCudaSlice<u32>,
coo_fact: &mut TrackedCudaSlice<u32>,
coo_cand: &mut TrackedCudaSlice<u32>,
) -> Result<()>
pub fn ilp_coo_fill_from_mask_launch( &self, mask: &TrackedCudaSlice<u8>, prefix_sum: &TrackedCudaSlice<u32>, fact_indices: &TrackedCudaSlice<u32>, offset_idx: u32, cand_value: u32, num_query: u32, d_offsets: &TrackedCudaSlice<u32>, coo_fact: &mut TrackedCudaSlice<u32>, coo_cand: &mut TrackedCudaSlice<u32>, ) -> Result<()>
Fill COO arrays from a device-side mask and prefix-sum.
For each set bit in mask, writes the corresponding fact_indices entry
into coo_fact and cand_value into coo_cand at the position
determined by d_offsets[offset_idx] + prefix_sum[tid].
Parameters:
offset_idx: index intod_offsetsfor the write base positioncand_value: actual candidate index to write intocoo_cand
This keeps COO assembly fully on device, eliminating the mask D2H transfer.
Sourcepub fn ilp_csr_histogram_launch(
&self,
sorted_facts: &TrackedCudaSlice<u32>,
nnz: u32,
num_facts: u32,
) -> Result<TrackedCudaSlice<u32>>
pub fn ilp_csr_histogram_launch( &self, sorted_facts: &TrackedCudaSlice<u32>, nnz: u32, num_facts: u32, ) -> Result<TrackedCudaSlice<u32>>
Build a histogram of fact indices from sorted COO data.
For each entry in sorted_facts[0..nnz], atomically increments
the corresponding bin in the output histogram. The result is a
device-side count array of length num_facts, suitable for
prefix-sum to produce CSR row_offsets.
The caller provides sorted fact indices; the histogram is zero-initialized internally.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn ilp_exact_score_topk(
&self,
candidate_buffers: &[&CudaBuffer],
positives: &CudaBuffer,
negatives: &CudaBuffer,
k_per_topology: u32,
) -> Result<Vec<IlpExactTopkCandidate>>
pub fn ilp_exact_score_topk( &self, candidate_buffers: &[&CudaBuffer], positives: &CudaBuffer, negatives: &CudaBuffer, k_per_topology: u32, ) -> Result<Vec<IlpExactTopkCandidate>>
Score on GPU, reduce per-topology top-K on GPU, and transfer only the compact selected rows back to host.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn create_buffer_from_u32_columns(
&self,
columns: &[&[u32]],
schema: Schema,
) -> Result<CudaBuffer>
pub fn create_buffer_from_u32_columns( &self, columns: &[&[u32]], schema: Schema, ) -> Result<CudaBuffer>
Sourcepub fn create_buffer_from_slices(
&self,
slices: &[&[u8]],
schema: Schema,
) -> Result<CudaBuffer>
pub fn create_buffer_from_slices( &self, slices: &[&[u8]], schema: Schema, ) -> Result<CudaBuffer>
Create a buffer from multiple column slices (raw bytes)
This is a generic version that works with any column type by accepting raw byte slices. Each slice should contain the column data in little-endian format with the correct size for the column’s type.
§Arguments
slices- Slice of raw byte slices, one per columnschema- The schema for the buffer
§Returns
A new CudaBuffer containing all columns
§Errors
Returns XlogError::Kernel if:
- Number of slices doesn’t match schema arity
- Upload fails
Sourcepub fn to_arrow_device_record_batch(
&self,
buffer: CudaBuffer,
) -> Result<ArrowDeviceArrayOwned>
pub fn to_arrow_device_record_batch( &self, buffer: CudaBuffer, ) -> Result<ArrowDeviceArrayOwned>
Export CudaBuffer to Arrow C Data Interface (device-resident).
This is a zero-copy export: column buffers remain on device, and the returned ArrowDeviceArray describes CUDA-resident memory.
The export requires that the device row count matches the host row_cap
Sourcepub fn to_arrow_record_batch(&self, buffer: &CudaBuffer) -> Result<RecordBatch>
pub fn to_arrow_record_batch(&self, buffer: &CudaBuffer) -> Result<RecordBatch>
Export CudaBuffer to Arrow RecordBatch
Downloads data from GPU and converts it to an Arrow RecordBatch for interoperability with Arrow-based tools like cuDF, Polars, or DuckDB.
§Arguments
buffer- The CudaBuffer to export
§Returns
An Arrow RecordBatch containing all columns from the buffer
§Errors
Returns XlogError::Kernel if:
- Column download fails
- RecordBatch creation fails
Sourcepub fn from_arrow_record_batch(
&self,
record_batch: &RecordBatch,
) -> Result<CudaBuffer>
pub fn from_arrow_record_batch( &self, record_batch: &RecordBatch, ) -> Result<CudaBuffer>
Sourcepub fn to_arrow_ipc_stream(&self, buffer: &CudaBuffer) -> Result<Vec<u8>>
pub fn to_arrow_ipc_stream(&self, buffer: &CudaBuffer) -> Result<Vec<u8>>
Export a CudaBuffer to an Arrow IPC stream (RecordBatchStream) as bytes.
This is a convenience wrapper around to_arrow_record_batch that enables
interoperability with tools like cuDF via standard Arrow IPC readers.
Note: This is not zero-copy; data is downloaded from GPU to host memory.
Sourcepub fn from_arrow_ipc_stream(&self, ipc: &[u8]) -> Result<CudaBuffer>
pub fn from_arrow_ipc_stream(&self, ipc: &[u8]) -> Result<CudaBuffer>
Import a single-batch Arrow IPC stream (RecordBatchStream) into a CudaBuffer.
Note: This uploads Arrow data from host to GPU memory.
Sourcepub fn write_arrow_ipc_stream_file<P: AsRef<Path>>(
&self,
buffer: &CudaBuffer,
path: P,
) -> Result<()>
pub fn write_arrow_ipc_stream_file<P: AsRef<Path>>( &self, buffer: &CudaBuffer, path: P, ) -> Result<()>
Write a CudaBuffer to a file as an Arrow IPC stream (RecordBatchStream).
Sourcepub fn read_arrow_ipc_stream_file<P: AsRef<Path>>(
&self,
path: P,
) -> Result<CudaBuffer>
pub fn read_arrow_ipc_stream_file<P: AsRef<Path>>( &self, path: P, ) -> Result<CudaBuffer>
Read a CudaBuffer from a file containing an Arrow IPC stream (RecordBatchStream).
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn memset_recorded(
&self,
dst: &mut TrackedCudaSlice<u8>,
value: u8,
launch_stream: StreamId,
) -> Result<()>
pub fn memset_recorded( &self, dst: &mut TrackedCudaSlice<u8>, value: u8, launch_stream: StreamId, ) -> Result<()>
Async memset of value into every byte of dst on
launch_stream, then record the use against the
runtime.
Requires the provider’s GpuMemoryManager to be built
via crate::GpuMemoryManager::with_runtime (so
dst.runtime_block() is Some and the runtime is
reachable). On a legacy/no-runtime manager, returns
[XlogError::Kernel].
§Errors
XlogError::Kernel("memset_recorded requires runtime-backed manager")if the manager has no runtime attached.XlogError::KernelfromcuMemsetD8Async/stream-resolution failure.XlogError::Kernelwrapping anyResourceError::StreamMisusefrom the recorder’s commit (notably when the active resource isDirectCudaResource— the trait default that intentionally rejectsrecord_block_use).
Sourcepub fn memset_column_recorded(
&self,
dst: &mut CudaColumn,
value: u8,
launch_stream: StreamId,
) -> Result<()>
pub fn memset_column_recorded( &self, dst: &mut CudaColumn, value: u8, launch_stream: StreamId, ) -> Result<()>
Column-level variant of Self::memset_recorded —
exercises the LaunchRecorder::write_column path. Used
by tests that prove CudaColumn::Owned records its
runtime block automatically; strict mode rejects
CudaColumn::Dlpack / CudaColumn::ArrowDevice at
preflight (no CUDA work queued).
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn sample_bernoulli_matrix(
&self,
probs: &[f32],
num_samples: usize,
seed: u64,
force_mask: &CudaView<'_, u8>,
forced_value: &CudaView<'_, u8>,
) -> Result<Vec<u8>>
pub fn sample_bernoulli_matrix( &self, probs: &[f32], num_samples: usize, seed: u64, force_mask: &CudaView<'_, u8>, forced_value: &CudaView<'_, u8>, ) -> Result<Vec<u8>>
Sample independent Bernoulli variables on the GPU.
Returns a row-major (sample, var) matrix as a flat Vec<u8> of length
num_samples * probs.len(), where each entry is 0/1.
Sourcepub fn sample_bernoulli_matrix_device(
&self,
probs: &[f32],
num_samples: usize,
seed: u64,
force_mask: &CudaView<'_, u8>,
forced_value: &CudaView<'_, u8>,
) -> Result<TrackedCudaSlice<u8>>
pub fn sample_bernoulli_matrix_device( &self, probs: &[f32], num_samples: usize, seed: u64, force_mask: &CudaView<'_, u8>, forced_value: &CudaView<'_, u8>, ) -> Result<TrackedCudaSlice<u8>>
Sample Bernoulli matrix on GPU and return device-resident output.
Returns a row-major [num_samples][num_vars] matrix of 0/1 bytes on device.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn hash_join(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
) -> Result<CudaBuffer>
pub fn hash_join( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], ) -> Result<CudaBuffer>
Perform a hash join between two buffers
Uses a two-phase hash join:
- Build phase: Insert keys from
rightinto a hash table - Probe phase: Match keys from
leftagainst the hash table
§Arguments
left- The left (probe) bufferright- The right (build) bufferleft_keys- Column indices for join keys in left bufferright_keys- Column indices for join keys in right buffer
§Returns
A buffer containing the joined rows with columns from both inputs
§Errors
Returns XlogError::Kernel if kernel execution fails
Sourcepub fn hash_join_with_limit(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
max_output: Option<usize>,
) -> Result<CudaBuffer>
pub fn hash_join_with_limit( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, ) -> Result<CudaBuffer>
Hash join with configurable maximum output size
Uses a two-phase hash join:
- Build phase: Insert keys from
rightinto a hash table - Probe phase: Match keys from
leftagainst the hash table
§Arguments
left- The left (probe) bufferright- The right (build) bufferleft_keys- Column indices for join keys in left bufferright_keys- Column indices for join keys in right buffermax_output- Maximum number of output rows (defaults to DEFAULT_JOIN_MAX_OUTPUT)
§Returns
A buffer containing the joined rows with columns from both inputs
§Errors
Returns XlogError::Kernel if kernel execution fails
Sourcepub fn dedup(
&self,
input: &CudaBuffer,
key_cols: &[usize],
) -> Result<CudaBuffer>
pub fn dedup( &self, input: &CudaBuffer, key_cols: &[usize], ) -> Result<CudaBuffer>
Remove duplicate rows based on key columns
Sorts the input by the provided key columns, then removes adjacent duplicates.
§Arguments
input- The input bufferkey_cols- Column indices to use for duplicate detection
§Returns
A buffer containing one row per duplicate-equivalence class
§Errors
Returns XlogError::Kernel if kernel execution fails
Sourcepub fn dedup_sorted(
&self,
input: &CudaBuffer,
key_cols: &[usize],
) -> Result<CudaBuffer>
pub fn dedup_sorted( &self, input: &CudaBuffer, key_cols: &[usize], ) -> Result<CudaBuffer>
Remove duplicate rows from a buffer that is already sorted by key columns
This is an optimized version of dedup that skips the sorting step.
The caller must ensure the input is already sorted by the key columns.
§Arguments
input- The input buffer (must be sorted by key columns)key_cols- Column indices to use for duplicate detection
§Returns
A buffer containing one row per duplicate-equivalence class
Sourcepub fn union(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn union(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Sourcepub fn diff(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn diff(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Compute set difference (a - b)
Returns rows from a that don’t exist in b.
Uses hash-based approach: build hash table from b, probe with a.
§Arguments
a- Source bufferb- Buffer to subtract
§Returns
A buffer containing rows in a but not in b
§Errors
Returns XlogError::Kernel if schemas don’t match or operation fails
Sourcepub fn union_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn union_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
GPU-native union (no host roundtrip)
Computes the union of two buffers entirely on the GPU using:
- Concatenate arrays using concat_u32 kernel
- Sort the concatenated result
- Deduplicate using existing dedup()
§Arguments
a- First bufferb- Second buffer
§Returns
A buffer containing deduplicated union of both inputs, sorted
§Errors
Returns XlogError::Kernel if schemas don’t match or operation fails
Sourcepub fn diff_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
pub fn diff_gpu(&self, a: &CudaBuffer, b: &CudaBuffer) -> Result<CudaBuffer>
Set difference (a - b) with deterministic set semantics.
Single-column u32 buffers use a GPU sorted-diff fast path. General
multi-column buffers use a byte-exact host set fallback after GPU dedup;
the hash anti-join implementation is intentionally not used for Datalog
delta subtraction because its unordered parallel probe path can leak
nondeterminism into recursive fixed-point convergence.
§Arguments
a- Source bufferb- Buffer to subtract
§Returns
A buffer containing elements in a but not in b, sorted and deduped
§Errors
Returns XlogError::Kernel if schemas don’t match or operation fails
Sourcepub fn dedup_full_row(&self, input: &CudaBuffer) -> Result<CudaBuffer>
pub fn dedup_full_row(&self, input: &CudaBuffer) -> Result<CudaBuffer>
Public deterministic full-row dedup with totalOrder-bytewise equality semantics for all arities (including single-column float buffers).
Differs from dedup(input, &[0]) for single-column float
columns: the legacy single-column GPU kernel collapses +0/-0
(IEEE == says they’re equal) and treats two NaNs with
different payloads as distinct. dedup_full_row instead uses
totalOrder-bijective bytewise equality, so:
+0.0and-0.0are distinct.- Two NaNs collapse iff bit-identical.
Routing today:
dedup(input, &all_cols)witharity > 1routes to the full-row pipeline (same semantics as this method).dedup(input, &[0])witharity == 1keeps the legacy single-column GPU kernel — IEEE==for floats, so +0/-0 collapse and NaNs collapse iff bit-identical-or-IEEE-eq.dedup_full_row(input)always uses bytewise totalOrder equality for all arities, so single-column float callers must use this method explicitly to get the totalOrder semantics.
Multi-column callers that pass the all-columns key vector to
dedup already route through the same deterministic full-row
pipeline; single-column callers that want totalOrder semantics
must call dedup_full_row directly.
Sourcepub fn diff_full_row(
&self,
a: &CudaBuffer,
b: &CudaBuffer,
) -> Result<CudaBuffer>
pub fn diff_full_row( &self, a: &CudaBuffer, b: &CudaBuffer, ) -> Result<CudaBuffer>
Public deterministic full-row set difference. Equivalent to
diff_gpu(a, b) for the multi-column path but named explicitly so
callers cannot mistake it for the older first-column-key diff.
a and b must have type-compatible schemas.
Sourcepub fn sort(&self, input: &CudaBuffer, key_cols: &[usize]) -> Result<CudaBuffer>
pub fn sort(&self, input: &CudaBuffer, key_cols: &[usize]) -> Result<CudaBuffer>
Sort buffer by key columns.
Computes a stable row permutation on the GPU (supports multi-column and all scalar types), then applies the permutation on the GPU to reorder all columns.
§Arguments
input- The input buffer to sortkey_cols- Column indices to use for sorting (lexicographic, first key is most significant)
§Returns
A new buffer with rows sorted by the key columns
§Errors
Returns XlogError::Kernel if:
key_colsis empty or out of bounds- Input has more than
u32::MAXrows - Download/upload or kernel execution fails
Sourcepub fn init_indices(
&self,
indices: &mut TrackedCudaSlice<u32>,
n: u32,
) -> Result<()>
pub fn init_indices( &self, indices: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>
Initialize indices array with 0..n-1 on device.
Sourcepub fn gather_u32_by_indices(
&self,
input: &TrackedCudaSlice<u32>,
indices: &TrackedCudaSlice<u32>,
output: &mut TrackedCudaSlice<u32>,
n: u32,
) -> Result<()>
pub fn gather_u32_by_indices( &self, input: &TrackedCudaSlice<u32>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>
Gather u32 keys by permutation: out[i] = input[indices[i]].
Sourcepub fn gather_u8_by_indices(
&self,
input: &TrackedCudaSlice<u8>,
indices: &TrackedCudaSlice<u32>,
output: &mut TrackedCudaSlice<u8>,
n: u32,
) -> Result<()>
pub fn gather_u8_by_indices( &self, input: &TrackedCudaSlice<u8>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u8>, n: u32, ) -> Result<()>
Gather u8 values by permutation: out[i] = input[indices[i]].
Sourcepub fn gather_u64_lo_by_indices(
&self,
input: &TrackedCudaSlice<u64>,
indices: &TrackedCudaSlice<u32>,
output: &mut TrackedCudaSlice<u32>,
n: u32,
) -> Result<()>
pub fn gather_u64_lo_by_indices( &self, input: &TrackedCudaSlice<u64>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>
Gather low 32 bits of u64 values by permutation.
Sourcepub fn gather_u64_hi_by_indices(
&self,
input: &TrackedCudaSlice<u64>,
indices: &TrackedCudaSlice<u32>,
output: &mut TrackedCudaSlice<u32>,
n: u32,
) -> Result<()>
pub fn gather_u64_hi_by_indices( &self, input: &TrackedCudaSlice<u64>, indices: &TrackedCudaSlice<u32>, output: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>
Gather high 32 bits of u64 values by permutation.
Sourcepub fn radix_sort_u32_pairs(
&self,
keys: &mut TrackedCudaSlice<u32>,
values: &mut TrackedCudaSlice<u32>,
n: u32,
scratch: &mut RadixSortScratch,
) -> Result<()>
pub fn radix_sort_u32_pairs( &self, keys: &mut TrackedCudaSlice<u32>, values: &mut TrackedCudaSlice<u32>, n: u32, scratch: &mut RadixSortScratch, ) -> Result<()>
Stable radix sort of (key, value) u32 pairs using reusable scratch.
Sourcepub fn scan_u8_mask_device(
&self,
mask: &TrackedCudaSlice<u8>,
n: u32,
) -> Result<TrackedCudaSlice<u32>>
pub fn scan_u8_mask_device( &self, mask: &TrackedCudaSlice<u8>, n: u32, ) -> Result<TrackedCudaSlice<u32>>
Compute exclusive prefix sum of u8 mask on device (no host reads).
Sourcepub fn count_mask_device(
&self,
mask: &TrackedCudaSlice<u8>,
n: u32,
) -> Result<TrackedCudaSlice<u32>>
pub fn count_mask_device( &self, mask: &TrackedCudaSlice<u8>, n: u32, ) -> Result<TrackedCudaSlice<u32>>
Count non-zero entries in a u8 mask on device (no host reads).
Returns a 1-element device buffer containing the count.
Sourcepub fn count_mask_into_slot(
&self,
mask: &TrackedCudaSlice<u8>,
n: u32,
task_counts: &mut TrackedCudaSlice<u32>,
slot_idx: usize,
) -> Result<()>
pub fn count_mask_into_slot( &self, mask: &TrackedCudaSlice<u8>, n: u32, task_counts: &mut TrackedCudaSlice<u32>, slot_idx: usize, ) -> Result<()>
Count 1-bits in mask[0..n] and write the result into
task_counts[slot_idx] via the existing count_mask kernel.
The caller MUST ensure task_counts[slot_idx] is zero before
calling (e.g. by zeroing the whole array once).
This avoids allocating a fresh 1-element device buffer per call, which matters when iterating over hundreds of tasks.
Sourcepub fn hash_join_v2(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
join_type: JoinType,
) -> Result<CudaBuffer>
pub fn hash_join_v2( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, ) -> Result<CudaBuffer>
Multi-column hash join with support for different join types.
§Arguments
left- The left (probe) bufferright- The right (build) bufferleft_keys- Column indices for join keys in left bufferright_keys- Column indices for join keys in right bufferjoin_type- Type of join to perform (Inner, Semi, Anti, LeftOuter)
§Errors
Returns XlogError::Kernel if kernel execution fails or parameters are invalid
Sourcepub fn hash_join_v2_with_limit(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
join_type: JoinType,
max_output: Option<usize>,
) -> Result<CudaBuffer>
pub fn hash_join_v2_with_limit( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, max_output: Option<usize>, ) -> Result<CudaBuffer>
V2 hash join with configurable maximum output size
Multi-column join with typed key comparison, supporting different join types. Uses composite hashing (FNV-1a) for multi-column keys with full key verification.
§Arguments
left- The left (probe) bufferright- The right (build) bufferleft_keys- Column indices for join keys in left bufferright_keys- Column indices for join keys in right bufferjoin_type- Type of join to perform (Inner, Semi, Anti, LeftOuter)max_output- Optional maximum number of output rows (None = unlimited, subject to memory budget)
§Errors
Returns XlogError::Kernel if kernel execution fails or parameters are invalid
Sourcepub fn nested_loop_join_v2_inner_u32_1key(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_key: usize,
right_key: usize,
) -> Result<CudaBuffer>
pub fn nested_loop_join_v2_inner_u32_1key( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, ) -> Result<CudaBuffer>
Nested-loop inner join (emit-pairs design).
Drop-in compatible with hash_join_v2(_, _, &[left_key], &[right_key], JoinType::Inner): same input types, same
output schema (combine_schemas(left, right)), same row
set. Caller (the executor’s dispatch site) is
responsible for choosing between hash_join_v2 and this
fn based on the eligibility predicate + threshold check;
this fn validates the same contract fail-closed and
returns Err if a caller violates it.
§Eligibility (validated inside; Err on violation)
left.arity() > left_key && right.arity() > right_key.- Left and right key columns share the same
ScalarType, and that shared type isU32orSymbol(Symbol isu32at the byte level — same kernel applies). - Each key column’s allocation is at least
num_rows * 4bytes (preflight lower-bound validation; mirrors thecrates/xlog-cuda/src/provider/ilp.rs:18codebase idiomcol.num_bytes() < required_bytes).CudaColumn::num_bytes()reports the allocation size, which can exceednum_rows * 4when the buffer has spare capacity (row_cap > num_rows); strict-equality validation would false-positive-reject normal over-allocated buffers reaching this path throughExecutor::execute_node. num_left * num_right <= NESTED_LOOP_TOTAL_THRESHOLD(computed viachecked_mul; release-mode wrapping multiply is forbidden).
§Implementation outline
- Read logical row counts via
device_row_count(NOTrow_cap). - Empty-input fast path: if either side is empty, return
create_empty_buffer(combine_schemas(...))— mirrorshash_join_inner_v2’s pattern atcrates/xlog-cuda/src/provider/relational.rs:3165-3170. - Validate eligibility (above).
- Allocate two
u32index arrays of lengthnum_left * num_right(bounded at 32 MB total under the threshold). - Launch
nested_loop_join_inner_u32_1key_pairswith&CudaColumnkey pointers (variant-agnostic). - D2H the output count.
- Materialize via
gather_buffer_by_indicesfor both sides + concatenate columns.
Sourcepub fn is_sorted_ascending_u32(
&self,
buf: &CudaBuffer,
key_col: usize,
) -> Result<bool>
pub fn is_sorted_ascending_u32( &self, buf: &CudaBuffer, key_col: usize, ) -> Result<bool>
Sort-merge sortedness-detection wrapper. Returns Ok(true) iff
the column at key_col of buf is sorted ascending
(keys[i] <= keys[i+1] for all i in [0, num_rows-1)),
Ok(false) if a violation is detected, Err(_) on
kernel-launch / D2H failure.
Empty / single-row fast path: n < 2 returns Ok(true) BEFORE allocation
or kernel launch. The detection kernel’s grid (n + 255) / 256 is undefined for n == 0; single-row sequences
are trivially sorted. This is the load-bearing
invariant the empty-input sortedness checks verify.
Validation:
- Key column index within arity bounds.
- Key column type is
U32orSymbol(byte-identical at the kernel level). - Key column allocation
>= num_rows * 4bytes (mirrors the nested-loop byte-length lower-bound idiom).
Caller surface: this fn has no executor-dispatch caller after
benchmark-backed unwiring. Its only callers are operator-level tests
and the production sort-merge benchmark
(sort-merge-with-detection timing). The provider returns the honest Result<bool>
— the kernel can fail (allocation, launch, D2H), and
Err(_) is preserved so callers can log or surface it
at their abstraction level. There is no fail-closed
dispatch contract anymore. Earlier fail-closed callers used
matches!(_, Ok(true)); after the dispatch site was unwired,
any later caller must decide its own Err-handling policy.
Sourcepub fn sort_merge_join_v2_inner_u32_1key(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_key: usize,
right_key: usize,
) -> Result<CudaBuffer>
pub fn sort_merge_join_v2_inner_u32_1key( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, ) -> Result<CudaBuffer>
Sort-merge inner join (caller-asserted pre-sorted
inputs). Drop-in compatible with hash_join_v2(_, _, &[left_key], &[right_key], JoinType::Inner): same
input types, same output schema
(combine_schemas(left, right)), same row set.
Caller surface: this fn has no executor-dispatch caller after
benchmark-backed unwiring. Production benchmark evidence rejected
default executor precedence for sort-merge at execute_join; this fn
remains graduated operator work for direct provider callers and tests.
Current callers: operator-level provider parity tests in
crates/xlog-integration/tests/test_w43_sort_merge_dispatch.rs
and the production sort-merge benchmark at
crates/xlog-integration/benches/sort_merge_production_bench.rs
(sort-merge-with-detection Path 1 timing).
Caller contract: both inputs are pre-sorted ascending
by their respective key column. The kernel does NOT
detect or enforce sortedness; callers may pre-check via
is_sorted_ascending_u32. On unsorted inputs the row-set
output is undefined; the dispatch-site fallback path no longer exists.
§Eligibility (validated inside; Err on violation)
left.arity() > left_key && right.arity() > right_key.- Left and right key columns share the same
ScalarType, and that shared type isU32orSymbol. - Each key column’s allocation is at least
num_rows * 4bytes (lower-bound check, mirrors the nested-loop byte-length guard). num_left * num_right <= NESTED_LOOP_TOTAL_THRESHOLD(shared with the nested-loop operator; computed viachecked_mul; release-mode wrapping multiply is forbidden).
§Implementation outline
Mirrors nested_loop_join_v2_inner_u32_1key implementation idioms:
empty fast path with no ?, byte-length lower-bound < check,
checked_mul for threshold, as u64 for row_cap, and
variant-agnostic &CudaColumn launch.
- Read logical row counts via
device_row_count(NOTrow_cap). - Empty-input fast path: if either side is empty,
return
create_empty_buffer(combine_schemas(...))— mirrorshash_join_inner_v2atrelational.rs:3165-3170ANDnested_loop_join_v2_inner_u32_1key’s identical pattern. - Validate eligibility (above).
- Allocate two
u32index arrays of lengthnum_left * num_right(bounded at 32 MB total under the shared threshold). - Launch
sort_merge_join_inner_u32_1key_pairswith&CudaColumnkey pointers (variant-agnostic). - D2H the output count.
- Materialize via
gather_buffer_by_indicesfor both sides + concatenate columns.
Sourcepub fn sort_merge_join_v2_inner_u32_1key_bounded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_key: usize,
right_key: usize,
output_capacity: usize,
) -> Result<CudaBuffer>
pub fn sort_merge_join_v2_inner_u32_1key_bounded( &self, left: &CudaBuffer, right: &CudaBuffer, left_key: usize, right_key: usize, output_capacity: usize, ) -> Result<CudaBuffer>
Sorted-chain variant of Self::sort_merge_join_v2_inner_u32_1key.
The sort-merge operator is product-thresholded because it allocates
|left| * |right| candidate pairs. Chain routing uses this bounded
variant only for sorted large inputs where the expected
fanout is one-to-one; capacity is caller supplied and the kernel’s
logical output counter is checked after launch. If duplicates make
the true output exceed output_capacity, this returns an error so
the caller can fail closed to the hash fallback.
Sourcepub fn build_join_index_v2(
&self,
right: &CudaBuffer,
right_keys: &[usize],
) -> Result<JoinIndexV2>
pub fn build_join_index_v2( &self, right: &CudaBuffer, right_keys: &[usize], ) -> Result<JoinIndexV2>
Build a cached join index for the right/build side of v2 hash join.
Sourcepub fn build_join_index_v2_background(
&self,
right: &CudaBuffer,
right_keys: &[usize],
) -> Result<JoinIndexV2>
pub fn build_join_index_v2_background( &self, right: &CudaBuffer, right_keys: &[usize], ) -> Result<JoinIndexV2>
Build a cached join index for background persistent-index mode.
When recorded hash joins are enabled and the provider has a runtime-backed manager, the build is enqueued on the provider’s recorded operation stream and dependency-recorded like the indexed join consumer path. Otherwise this falls back to the legacy synchronous builder.
Sourcepub fn build_join_index_v2_recorded(
&self,
right: &CudaBuffer,
right_keys: &[usize],
launch_stream: StreamId,
) -> Result<JoinIndexV2>
pub fn build_join_index_v2_recorded( &self, right: &CudaBuffer, right_keys: &[usize], launch_stream: StreamId, ) -> Result<JoinIndexV2>
Recorded-stream join-index builder used by persistent background builds.
The build side is packed and bucketized on launch_stream; the returned
JoinIndexV2 carries runtime-tracked buffers whose writes were committed
through the launch recorder / stream dependency machinery.
Sourcepub fn hash_join_v2_with_index(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
join_type: JoinType,
index: &JoinIndexV2,
max_output: Option<usize>,
) -> Result<CudaBuffer>
pub fn hash_join_v2_with_index( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, index: &JoinIndexV2, max_output: Option<usize>, ) -> Result<CudaBuffer>
Hash join using a cached build-side join index.
The index must have been built for the same right buffer and right_keys.
Sourcepub fn build_hash_table_u64(
&self,
hashes: &TrackedCudaSlice<u64>,
num_rows: u32,
) -> Result<HashTableU64>
pub fn build_hash_table_u64( &self, hashes: &TrackedCudaSlice<u64>, num_rows: u32, ) -> Result<HashTableU64>
Build a bucketed hash table from a u64 hash array.
Sourcepub fn membership_mask_device(
&self,
probe: &CudaBuffer,
build: &CudaBuffer,
probe_keys: &[usize],
build_keys: &[usize],
) -> Result<TrackedCudaSlice<u8>>
pub fn membership_mask_device( &self, probe: &CudaBuffer, build: &CudaBuffer, probe_keys: &[usize], build_keys: &[usize], ) -> Result<TrackedCudaSlice<u8>>
Compute a per-row membership mask on device: for each row in probe,
check whether a matching row exists in build (by the specified key
columns). Returns a TrackedCudaSlice<u8> of length = probe row count
that stays GPU-resident (no D2H transfer).
Sourcepub fn membership_mask(
&self,
probe: &CudaBuffer,
build: &CudaBuffer,
probe_keys: &[usize],
build_keys: &[usize],
) -> Result<Vec<bool>>
pub fn membership_mask( &self, probe: &CudaBuffer, build: &CudaBuffer, probe_keys: &[usize], build_keys: &[usize], ) -> Result<Vec<bool>>
Compute a per-row membership mask: for each row in probe, check whether
a matching row exists in build (by the specified key columns).
Returns a Vec<bool> of length = probe row count.
This downloads only num_probe bytes (the mask), NOT column data.
Sourcepub fn clone_buffer(&self, buffer: &CudaBuffer) -> Result<CudaBuffer>
pub fn clone_buffer(&self, buffer: &CudaBuffer) -> Result<CudaBuffer>
Clone a buffer (deep copy) on-device.
This is primarily used when a caller needs owned buffer state for a separate runtime object while preserving the original relation store.
Sourcepub fn extract_column(
&self,
buffer: &CudaBuffer,
col_idx: usize,
) -> Result<CudaBuffer>
pub fn extract_column( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<CudaBuffer>
Sourcepub fn extract_active_rule_indices(
&self,
mask_hard: &CudaBuffer,
mask_soft: &CudaBuffer,
n: usize,
max_active: usize,
) -> Result<Vec<(u32, u32, u32)>>
pub fn extract_active_rule_indices( &self, mask_hard: &CudaBuffer, mask_soft: &CudaBuffer, n: usize, max_active: usize, ) -> Result<Vec<(u32, u32, u32)>>
Extract active (i,j,k) rule indices from a flattened N×N×N mask.
Returns up to max_active entries sorted by soft-mask priority.
Sourcepub fn sort_recorded(
&self,
input: &CudaBuffer,
key_cols: &[usize],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn sort_recorded( &self, input: &CudaBuffer, key_cols: &[usize], launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder variant of Self::sort — narrow to
u32 / Symbol key columns. The whole sort chain
(init → LSD radix passes → multi-column gather) runs on
the caller-supplied launch_stream; every input column
and the input row-count buffer are recorded as reads
before preflight; every fresh runtime-backed allocation
(scratch + output columns + output d_num_rows) is
recorded via write BEFORE preflight (snapshot drops the borrow so kernel &mut borrows after preflight remain valid)
enqueue.
§Errors
- Manager not runtime-backed.
launch_streamdoes not resolve.- Empty
key_colsor out-of-bounds index. - Any key column type other than
U32/Symbol(multi-type recorded sort is outside this API surface). - Preflight / kernel / commit failures.
Sourcepub fn dedup_full_row_recorded(
&self,
input: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn dedup_full_row_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder variant of Self::dedup_full_row —
narrow to U32 / Symbol / U64 columns.
Composes Self::sort_recorded (typed multi-column
sort) → on-stream mark_unique_full_row_bytewise →
Self::compact_buffer_by_device_mask_counted_recorded
(gather kept rows). All three primitives commit
independently; the runtime’s record-all + wait-all
last_use_events: Vec<CudaEvent> semantics chain the
deallocate safety end-to-end.
Sourcepub fn hash_join_inner_v2_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_inner_v2_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder variant of hash_join_inner_v2.
JoinType::Inner only. Same count-then-materialize
algorithm as the legacy variant, but every kernel
runs on the caller-supplied launch_stream and host
scalar reads of the join output count are explicitly
ordered against the stream.
Sourcepub fn hash_join_inner_v2_count_scan_materialize_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_inner_v2_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder, deterministic-ordering Inner hash join using the deterministic binary-join path.
Algorithm: count → exclusive scan → device-resident
total → host scalar read → materialize with
per-probe-row offsets. Each probe row writes its
local-th match to
output[per_probe_offsets[tid] + local] directly —
no global atomicAdd(output_count) on the
materialize pass, so the output ordering is a
deterministic function of (probe-row index,
per-row match discovery order). Compare to
Self::hash_join_inner_v2_recorded which uses the
legacy count-then-atomic-materialize chain (correct
but with atomic-induced order non-determinism across
threads/blocks).
Sourced from the archived archive/gpu-resident-binary-join-prototype-*
branches — three new kernels migrated:
hash_join_probe_v2_count_per_row,
hash_join_probe_v2_materialize,
hash_join_total_from_scan. LeftOuter / Semi / Anti
/ indexed variants from the prototype are
intentionally not migrated here.
Reuses the recorded helpers pack_keys_gpu_on_stream,
build_hash_table_v2_on_stream,
multiblock_scan_u32_inplace_on_stream, and
gather_buffer_by_indices_on_stream. Inherits the compact / pack
fixes via composition.
Sourcepub fn hash_join_left_outer_v2_count_scan_materialize_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_left_outer_v2_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Non-indexed LeftOuter CSM using the deterministic binary-join path.
Deterministic count → scan → materialize chain producing
MATCHED (left_idx, right_idx) pairs first (Inner CSM
machinery), then a per-probe-row unmatched mask
(hash_join_csm_unmatched_mask) compacted via the
recorded compact tail to produce unmatched_left. The
final result is inner_left | unmatched_left per left
column and inner_right | zeros per right column —
matching the legacy hash_join_left_outer_v2_recorded
row-ordering invariant downstream consumers depend on.
This path does not adopt the archived prototype’s
hash_join_left_outer_count_per_row /
hash_join_left_outer_materialize design — those
kernels interleave matched and null-sentinel rows by
probe-row index, which would change the legacy
LeftOuter ordering downstream consumers depend on.
§Errors
- Manager not runtime-backed.
launch_streamdoes not resolve.left_keys/right_keysempty, mismatched length, or > 4 (pack_keys constraint).- Key column type mismatch.
- Preflight / kernel / commit failures.
Sourcepub fn hash_join_inner_v2_with_index_count_scan_materialize_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
index: &JoinIndexV2,
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_inner_v2_with_index_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Indexed-Inner CSM using the deterministic binary-join path.
Same deterministic count→scan→materialize algorithm as
Self::hash_join_inner_v2_count_scan_materialize_recorded
but skips pack-right + table-build — the cached
crate::provider::JoinIndexV2 supplies
index.packed_keys and &index.table. Only the probe
(left) side is packed on launch_stream.
Reuses the three CSM kernels from the non-indexed inner path
(hash_join_probe_v2_count_per_row,
hash_join_probe_v2_materialize,
hash_join_total_from_scan) — no new kernel additions.
Composes pack_keys_gpu_on_stream,
multiblock_scan_u32_inplace_on_stream, and
gather_buffer_by_indices_on_stream unchanged from
recorded helper paths.
Index buffers (packed_keys + 4 table buckets) are
owned by the caller and recorded as reads on
launch_stream for the count and materialize
recorders — dropping the index after the call returns
is correctly serialized through the runtime’s
record-all + wait-all event chain.
Sourcepub fn hash_join_left_outer_v2_with_index_count_scan_materialize_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
index: &JoinIndexV2,
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_left_outer_v2_with_index_count_scan_materialize_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Indexed LeftOuter CSM using the indexed deterministic binary-join path.
Combines the indexed-Inner CSM Phases A+B (probe-only
pack on launch_stream; cached
crate::provider::JoinIndexV2 supplies the build
side’s packed_keys and &index.table) with the
non-indexed LeftOuter CSM Phases C–E (per-probe
unmatched-mask via hash_join_csm_unmatched_mask →
recorded compact tail → gather matched left + right →
per-column inner | unmatched / inner | zeros
concat). Same row-ordering invariant as
Self::hash_join_left_outer_v2_count_scan_materialize_recorded:
matched rows first, unmatched-with-zero-right second.
No new kernels — reuses the four already-migrated CSM
kernels plus hash_join_csm_unmatched_mask from
the non-indexed LeftOuter CSM path.
§Errors
- Manager not runtime-backed.
launch_streamdoes not resolve.left_keys/right_keysempty, mismatched length, or > 4 (pack_keys constraint).- Key column type mismatch.
index.right_num_rows()mismatches the right buffer’s logical row count.index.right_keys()mismatches the requestedright_keys.left_packed.key_bytesmismatchesindex.key_bytes.- Preflight / kernel / commit failures.
Sourcepub fn hash_join_v2_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
join_type: JoinType,
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_v2_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder, launch_stream-routed variant of
hash_join_v2. Covers all four join types
(Inner / Semi / Anti / LeftOuter) via dedicated
per-type recorded methods.
When Self::use_recorded_csm_env is on, Inner and
LeftOuter route through the CSM (count-scan-materialize)
methods; otherwise they route through the legacy recorded
methods. Semi / Anti always route through their
existing recorded methods — no CSM implementation exists
for them. All eligibility checks (runtime-backed manager,
≤4 keys, key-type match, row-count caps) are validated
upstream by the public hash_join_v2_with_limit and inside
each per-type method.
Sourcepub fn hash_join_v2_with_index_recorded(
&self,
left: &CudaBuffer,
right: &CudaBuffer,
left_keys: &[usize],
right_keys: &[usize],
join_type: JoinType,
index: &JoinIndexV2,
max_output: Option<usize>,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn hash_join_v2_with_index_recorded( &self, left: &CudaBuffer, right: &CudaBuffer, left_keys: &[usize], right_keys: &[usize], join_type: JoinType, index: &JoinIndexV2, max_output: Option<usize>, launch_stream: StreamId, ) -> Result<CudaBuffer>
Strict-recorder, launch_stream-routed variant of
hash_join_v2_with_index. Supports all four join
types — the indexed variants share the same
(packed_keys, table) shape, so a single recorded
surface covers them.
When Self::use_recorded_csm_env is on, Inner and
LeftOuter route through the indexed CSM
(count-scan-materialize) methods; otherwise they route
through the legacy indexed recorded methods. Semi /
Anti always route through their existing indexed
recorded methods — no CSM implementation exists for them.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn download_column<T: GpuScalar>(
&self,
buffer: &CudaBuffer,
col_idx: usize,
) -> Result<Vec<T>>
pub fn download_column<T: GpuScalar>( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<Vec<T>>
Download a single column from GPU to host as Vec<T>.
Replaces: download_column_u32, download_column_u64,
download_column_i32, download_column_i64, download_column_f32,
download_column_f64, download_column_u8, download_column_bool.
Increments d2h_transfer_count (the per-call ILP-style counter)
and is checked by the strict deterministic-Datalog D2H guard when
it is enabled (see enable_strict_deterministic_d2h).
Sourcepub fn download_column_untracked<T: GpuScalar>(
&self,
buffer: &CudaBuffer,
col_idx: usize,
) -> Result<Vec<T>>
pub fn download_column_untracked<T: GpuScalar>( &self, buffer: &CudaBuffer, col_idx: usize, ) -> Result<Vec<T>>
Download a column WITHOUT incrementing the per-call
d2h_transfer_count (the ILP-style counter). Records in
transfer_tracker for byte/call profiling stats.
IS still checked by the strict deterministic-Datalog D2H guard when
it is enabled — “untracked” only refers to d2h_transfer_count,
not to the deterministic gate. Use dtoh_scalar_untracked for
metadata reads that must remain allowed under the gate.
Replaces: download_f64_untracked (now generic over T).
Sourcepub fn create_buffer_from_slice<T: GpuScalar>(
&self,
data: &[T],
schema: Schema,
) -> Result<CudaBuffer>
pub fn create_buffer_from_slice<T: GpuScalar>( &self, data: &[T], schema: Schema, ) -> Result<CudaBuffer>
Upload a typed slice as a single-column GPU buffer.
Replaces: create_buffer_from_u32_slice, create_buffer_from_u64_slice,
create_buffer_from_i32_slice, create_buffer_from_i64_slice,
create_buffer_from_f32_slice, create_buffer_from_f64_slice,
create_buffer_from_u8_slice.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn wcoj_layout_u32_recorded(
&self,
input: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_layout_u32_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Build the sorted+deduped WCOJ physical layout for a 2-column u32 relation.
Output: a 2-column u32 CudaBuffer sorted lexicographically
by (col0, col1) and deduplicated. The output is suitable for
direct consumption by Self::wcoj_triangle_u32_recorded in
any of the three slot positions (e_xy, e_yz, e_xz); the
caller chooses which logical relation each input represents
by the slot it passes the layout into.
Fast-path: if the input is already strictly lex-sorted and
full-row unique, a recorded checker proves that property and
the method returns a recorded device-side clone. Otherwise it
falls back to Self::dedup_full_row_recorded, which invokes
Self::sort_recorded (typed multi-column radix sort on
(col0, col1)) followed by an on-stream
mark_unique_full_row_bytewise mask + counted compaction.
Both paths are launch-recorder disciplined and preserve the
sorted+deduped output contract.
This entry exists for two reasons:
- Narrowing the input contract to 2-column u32 lets the WCOJ-specific call site fail fast with a clear error rather than the more generic dedup error if the caller passes the wrong arity / type.
- Naming the WCOJ pipeline boundary makes downstream callers (planner / executor wiring, cert harness) target the WCOJ-specific layout API rather than the general-purpose dedup primitive — separating concerns that may diverge as the WCOJ stack grows.
§Errors
XlogError::Kernelif the manager has no runtime (with_runtimeis required), the input is not 2-column, any column is not [ScalarType::U32] or [ScalarType::Symbol] (both share the same 4-byte physical layout, so the underlying sort/dedup primitives handle either with no kernel changes), or any inner sort/dedup primitive fails.
Sourcepub fn wcoj_layout_sort_u32_recorded(
&self,
input: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_layout_sort_u32_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Generic full-row WCOJ layout sort+dedup for relations of any
arity ≥ 2 in the 4-byte width-class (U32, Symbol,
mixable within the class).
Design: this entry point leaves the existing arity-2
Self::wcoj_layout_u32_recorded is unchanged for
the triangle / 4-cycle / project-then-layout callers — it
retains its arity-2-specific fast-path branch. This generic
surface delegates straight to
Self::dedup_full_row_recorded for any arity ≥ 2.
Validation order (runtime → arity ≥ 2 → per-column width-class → delegate):
- Manager runtime-backed.
input.arity() >= 2.- Every column type ∈
{U32, Symbol}(4-byte width-class). MixedU32+Symbolwithin one relation is permitted;U64is rejected — useSelf::wcoj_layout_sort_u64_recordedinstead. - Delegate to
dedup_full_row_recorded(input, launch_stream).
Stream resolution is owned by dedup_full_row_recorded
and is NOT in this entry point’s validation list. The
n == 0 short-circuit (returns
create_empty_buffer(input.schema().clone())) is also
owned downstream — single source of truth, no duplicated
empty-buffer semantics.
Composition: dedup_full_row_recorded only — there
is no fast-path branch for arity ≥ 3 in this generic
full-row layout-sort accessor (the
existing arity-2 fast-path stays untouched and reachable
only via wcoj_layout_u32_recorded).
§Errors
XlogError::Kernelif the manager has no runtime (with_runtimeis required).XlogError::Kernelifinput.arity() < 2.XlogError::Kernelif any column is notU32/Symbol.- Whatever
dedup_full_row_recordedreturns for stream-resolution / kernel-launch failures.
Sourcepub fn wcoj_triangle_u32_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Evaluate tri(X, Y, Z) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z)
on already-sorted, already-deduped binary 4-byte-key
relations. See module-level docs for the full contract.
Each column may be [ScalarType::U32] or
[ScalarType::Symbol] — both share the same 4-byte
physical layout, so the kernel reads the bits unchanged.
Cross-relation type compatibility (e.g., that Y is the
same type in e_xy.col1 and e_yz.col0) is the
planner’s responsibility upstream; this entry only
enforces width.
The output schema preserves per-head-position scalar types from the inputs:
out.col0=e_xy.col0type (X)out.col1=e_xy.col1type (Y)out.col2=e_yz.col1type (Z)
§Errors
XlogError::Kernelif the manager has no runtime (with_runtimeis required), the launch stream does not resolve, an input is not 2-column with U32/Symbol columns, or any kernel launch fails.
Sourcepub fn wcoj_4cycle_u32_recorded(
&self,
e1: &CudaBuffer,
e2: &CudaBuffer,
e3: &CudaBuffer,
e4: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_4cycle_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Evaluate cyc4(W, X, Y, Z) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)
on already-sorted, already-deduped binary 4-byte-key
relations. Structural mirror of Self::wcoj_triangle_u32_recorded
for the 4-cycle case; see that entry’s contract and the
module-level docs for the shared two-phase recorder
discipline.
Each column may be [ScalarType::U32] or
[ScalarType::Symbol] — both share the same 4-byte
physical layout, so the kernel reads the bits unchanged.
Cross-relation type compatibility (e.g., that X is the
same type in e1.col1 and e2.col0) is the planner’s
responsibility upstream; this entry only enforces width.
The 4-cycle slot order is [e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)].
The output schema preserves per-head-position scalar types
from the inputs:
out.col0=e1.col0type (W)out.col1=e1.col1type (X)out.col2=e2.col1type (Y)out.col3=e3.col1type (Z)
§Errors
XlogError::Kernelif the manager has no runtime (with_runtimeis required), the launch stream does not resolve, an input is not 2-column with U32/Symbol columns, or any kernel launch fails.
Sourcepub fn wcoj_layout_u64_recorded(
&self,
input: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_layout_u64_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Build the sorted+deduped WCOJ physical layout for a 2-column
U64 relation. Output: a 2-column U64 CudaBuffer sorted
lexicographically by (col0, col1) and deduplicated. Suitable
for direct consumption by Self::wcoj_triangle_u64_recorded.
Composition mirrors Self::wcoj_layout_u32_recorded:
already sorted+unique inputs take the recorded fast-path clone;
other inputs fall back to Self::dedup_full_row_recorded,
whose U64 sort_recorded path ports the legacy sort() hi/lo
radix-pass strategy into recorded launch discipline.
§Errors
XlogError::Kernelif the manager has no runtime, the input is not 2-column, any column is not [ScalarType::U64], or any inner sort/dedup primitive fails.
Sourcepub fn wcoj_layout_sort_u64_recorded(
&self,
input: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_layout_sort_u64_recorded( &self, input: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Generic full-row WCOJ layout sort+dedup for relations of any
arity ≥ 2 in the 8-byte width-class (U64 only).
Design: this entry point leaves the existing arity-2
Self::wcoj_layout_u64_recorded is unchanged for
the existing 2-column callers — it retains its
arity-2-specific fast-path branch. This generic surface
delegates straight to Self::dedup_full_row_recorded
for any arity ≥ 2.
Mirrors Self::wcoj_layout_sort_u32_recorded’s contract
at the 8-byte width-class — see that entry’s doc for the
validation order, stream-resolution ownership, n==0
semantics, and “no fast-path for arity ≥ 3” lock.
Validation order (runtime → arity ≥ 2 → per-column width-class → delegate):
- Manager runtime-backed.
input.arity() >= 2.- Every column type =
U64.U32/Symbolare rejected — useSelf::wcoj_layout_sort_u32_recordedinstead. - Delegate to
dedup_full_row_recorded(input, launch_stream).
§Errors
XlogError::Kernelif the manager has no runtime (with_runtimeis required).XlogError::Kernelifinput.arity() < 2.XlogError::Kernelif any column is notU64.- Whatever
dedup_full_row_recordedreturns for stream-resolution / kernel-launch failures.
Sourcepub fn wcoj_triangle_u64_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Evaluate tri(X, Y, Z) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z)
on already-sorted, already-deduped binary U64 relations.
Mirrors Self::wcoj_triangle_u32_recorded’s contract;
the only differences are the 8-byte join-key reads/writes
and the U64-specific count/materialize kernels. Counters
and the total reducer remain u32.
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column with U64 columns, or any kernel launch fails.
Sourcepub fn wcoj_4cycle_u64_recorded(
&self,
e1: &CudaBuffer,
e2: &CudaBuffer,
e3: &CudaBuffer,
e4: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_4cycle_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Evaluate
cycle4(W, X, Y, Z) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)
on already-sorted, already-deduped binary U64 relations.
Mirrors Self::wcoj_4cycle_u32_recorded’s contract; the
only differences are the 8-byte join-key reads/writes and
the U64 HG planner/count/materialize kernels. Counters and
the total reducer remain u32 (bounded by the upstream host-
side row-count guard).
4-cycle slot order:
[e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W)].
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column with U64 columns, or any kernel launch fails.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn wcoj_clique5_metadata_recorded_u32(
&self,
edges: &[&CudaBuffer; 10],
leader_edge_idx: u32,
launch_stream: StreamId,
) -> Result<WcojRelationMetadata<u32>>
pub fn wcoj_clique5_metadata_recorded_u32( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>
Build leader-edge runtime metadata for a 5-clique 4-byte-width dispatch.
Sourcepub fn wcoj_clique5_metadata_recorded_u64(
&self,
edges: &[&CudaBuffer; 10],
leader_edge_idx: u32,
launch_stream: StreamId,
) -> Result<WcojRelationMetadata<u64>>
pub fn wcoj_clique5_metadata_recorded_u64( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>
Build leader-edge runtime metadata for a 5-clique 8-byte-width dispatch.
Sourcepub fn wcoj_clique6_metadata_recorded_u32(
&self,
edges: &[&CudaBuffer; 15],
leader_edge_idx: u32,
launch_stream: StreamId,
) -> Result<WcojRelationMetadata<u32>>
pub fn wcoj_clique6_metadata_recorded_u32( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>
Build leader-edge runtime metadata for a 6-clique 4-byte-width dispatch.
Sourcepub fn wcoj_clique6_metadata_recorded_u64(
&self,
edges: &[&CudaBuffer; 15],
leader_edge_idx: u32,
launch_stream: StreamId,
) -> Result<WcojRelationMetadata<u64>>
pub fn wcoj_clique6_metadata_recorded_u64( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>
Build leader-edge runtime metadata for a 6-clique 8-byte-width dispatch.
Sourcepub fn wcoj_clique5_u32_recorded(
&self,
edges: &[&CudaBuffer; 10],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique5_u32_recorded( &self, edges: &[&CudaBuffer; 10], launch_stream: StreamId, ) -> Result<CudaBuffer>
5-clique WCOJ at 4-byte width-class.
edges must contain exactly 10 2-column buffers in
canonical lex (i, j) order (i < j): (0,1), (0,2), (0,3), (0,4), (1,2), (1,3), (1,4), (2,3), (2,4), (3,4). Each
column may be U32 or Symbol (mixable within the
4-byte width-class). All edges must be lex-sorted+deduped
on (col0, col1) — the runtime dispatcher routes through
wcoj_layout_sort_u32_recorded before calling here.
Sourcepub fn wcoj_clique5_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 10],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique5_u32_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
5-clique WCOJ at 4-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique5_u64_recorded(
&self,
edges: &[&CudaBuffer; 10],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique5_u64_recorded( &self, edges: &[&CudaBuffer; 10], launch_stream: StreamId, ) -> Result<CudaBuffer>
5-clique WCOJ at 8-byte width-class (U64 only).
Sourcepub fn wcoj_clique5_u64_recorded_planned(
&self,
edges: &[&CudaBuffer; 10],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique5_u64_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
5-clique WCOJ at 8-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique6_u32_recorded(
&self,
edges: &[&CudaBuffer; 15],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique6_u32_recorded( &self, edges: &[&CudaBuffer; 15], launch_stream: StreamId, ) -> Result<CudaBuffer>
6-clique WCOJ at 4-byte width-class.
edges must contain exactly 15 2-column buffers in
canonical lex (i, j) order. Width-class + sort+dedup
pre-condition match wcoj_clique5_u32_recorded.
Sourcepub fn wcoj_clique6_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 15],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique6_u32_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
6-clique WCOJ at 4-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique6_u64_recorded(
&self,
edges: &[&CudaBuffer; 15],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique6_u64_recorded( &self, edges: &[&CudaBuffer; 15], launch_stream: StreamId, ) -> Result<CudaBuffer>
6-clique WCOJ at 8-byte width-class (U64 only).
Sourcepub fn wcoj_clique6_u64_recorded_planned(
&self,
edges: &[&CudaBuffer; 15],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique6_u64_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
6-clique WCOJ at 8-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique7_u32_recorded(
&self,
edges: &[&CudaBuffer; 21],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique7_u32_recorded( &self, edges: &[&CudaBuffer; 21], launch_stream: StreamId, ) -> Result<CudaBuffer>
7-clique WCOJ at 4-byte width-class.
Sourcepub fn wcoj_clique7_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 21],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique7_u32_recorded_planned( &self, edges: &[&CudaBuffer; 21], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
7-clique WCOJ at 4-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique7_u64_recorded(
&self,
edges: &[&CudaBuffer; 21],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique7_u64_recorded( &self, edges: &[&CudaBuffer; 21], launch_stream: StreamId, ) -> Result<CudaBuffer>
7-clique WCOJ at 8-byte width-class (U64 only).
Sourcepub fn wcoj_clique7_u64_recorded_planned(
&self,
edges: &[&CudaBuffer; 21],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique7_u64_recorded_planned( &self, edges: &[&CudaBuffer; 21], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
7-clique WCOJ at 8-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique8_u32_recorded(
&self,
edges: &[&CudaBuffer; 28],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique8_u32_recorded( &self, edges: &[&CudaBuffer; 28], launch_stream: StreamId, ) -> Result<CudaBuffer>
8-clique WCOJ at 4-byte width-class.
Sourcepub fn wcoj_clique8_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 28],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique8_u32_recorded_planned( &self, edges: &[&CudaBuffer; 28], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
8-clique WCOJ at 4-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique8_u64_recorded(
&self,
edges: &[&CudaBuffer; 28],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique8_u64_recorded( &self, edges: &[&CudaBuffer; 28], launch_stream: StreamId, ) -> Result<CudaBuffer>
8-clique WCOJ at 8-byte width-class (U64 only).
Sourcepub fn wcoj_clique8_u64_recorded_planned(
&self,
edges: &[&CudaBuffer; 28],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique8_u64_recorded_planned( &self, edges: &[&CudaBuffer; 28], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
8-clique WCOJ at 8-byte width-class using plan-derived launch params.
Sourcepub fn wcoj_clique5_groupby_root_count_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 10],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique5_groupby_root_count_u32_recorded_planned( &self, edges: &[&CudaBuffer; 10], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
Fused 5-clique count-by-root at the 4-byte width-class,
using plan-derived launch params. See
Self::wcoj_clique_groupby_root_count_recorded_inner.
Sourcepub fn wcoj_clique6_groupby_root_count_u32_recorded_planned(
&self,
edges: &[&CudaBuffer; 15],
leader_edge_idx: u32,
edge_order: &[u8],
iteration_order: &[u8],
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_clique6_groupby_root_count_u32_recorded_planned( &self, edges: &[&CudaBuffer; 15], leader_edge_idx: u32, edge_order: &[u8], iteration_order: &[u8], launch_stream: StreamId, ) -> Result<CudaBuffer>
Fused 6-clique count-by-root at the 4-byte width-class,
using plan-derived launch params. See
Self::wcoj_clique_groupby_root_count_recorded_inner.
Source§impl CudaKernelProvider
impl CudaKernelProvider
pub fn wcoj_build_metadata_u32_recorded( &self, input: &CudaBuffer, key_col_idx: usize, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u32>>
pub fn wcoj_build_metadata_u64_recorded( &self, input: &CudaBuffer, key_col_idx: usize, launch_stream: StreamId, ) -> Result<WcojRelationMetadata<u64>>
pub fn wcoj_triangle_hg_work_plan_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojTriangleHgWorkPlanU32>
pub fn wcoj_triangle_count_hg_u32_recorded( &self, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<CudaBuffer>
pub fn wcoj_triangle_hg_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Sourcepub fn wcoj_triangle_groupby_root_count_u32_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_groupby_root_count_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Aggregate-fused triangle group-by-root count: evaluate
q(X, count) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) grouped by the
variable-order root X, WITHOUT materializing the triangle rows.
Pipeline (all recorded; the triangle result never exists as rows):
- the standard histogram-guided work plan;
wcoj_triangle_groupby_root_count_hg_u32accumulates per-e_xy-row match counts (integer atomicAdd — order-insensitive, deterministic values) into a zero-initializedn_xy-long array;- a 2-column (X, count) staging buffer over the input rows is compacted to count>0 rows (group-by over the join result must not emit roots with no completion) and reduced per X via the recorded groupby Sum (rows are already X-sorted because e_xy is lex-sorted).
All reduction work is O(n_xy) — input-sized, never join-output-sized.
Output schema matches the unfused materialize+groupby-count baseline:
col0 = X (e_xy.col0 type, U32/Symbol), col1 = count (U64).
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.
Sourcepub fn wcoj_triangle_groupby_root_agg_u32_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
agg_op: AggOp,
value: WcojRootAggValue,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_groupby_root_agg_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, agg_op: AggOp, value: WcojRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Aggregate-fused triangle group-by-root sum/min/max: evaluate
q(X, agg(V)) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) with
agg ∈ {Sum, Min, Max} and V ∈ {Y, Z} grouped by the
variable-order root X, WITHOUT materializing the triangle rows.
Pipeline (all recorded; the triangle result never exists as rows):
- the standard histogram-guided work plan;
- the per-op fused kernel accumulates, per e_xy row, a match count
(compaction mask) and the per-row partial aggregate (integer
atomics — order-insensitive, deterministic values). Sum partials
are u64 (a per-row partial can exceed
u32::MAX); min partials start atu32::MAX, max partials at 0; - a 3-column (X, count, agg) staging buffer over the input rows
is compacted to count>0 rows (groups with no completion must be
absent) and reduced per X via the recorded groupby with the same
AggOp(Sum over the u64 partials; Min/Max over u32).
All reduction work is O(n_xy) — input-sized, never join-output-sized.
Output schema matches the unfused materialize+groupby baseline:
col0 = X (e_xy.col0 type, U32/Symbol), col1 = U64 for Sum,
U32 for Min/Max.
Bag semantics: every (Y, Z) completion contributes its value, exactly like aggregating the materialized projection.
§Errors
XlogError::Kernelifagg_opis not Sum/Min/Max, the value columns are not plain U32, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.
Sourcepub fn wcoj_triangle_groupby_root_count_u64_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_groupby_root_count_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
U64-key aggregate-fused triangle count sibling of
Self::wcoj_triangle_groupby_root_count_u32_recorded: evaluate
q(X, count) over the triangle shape grouped by the root X for U64
relations, WITHOUT materializing the triangle rows.
The recorded groupby is U32/Symbol-key only, so the per-X reduction
reuses the WCOJ relation metadata instead: e_xy is lex-sorted, so
wcoj_build_metadata_u64_recorded yields one (unique X, group start)
pair per root, and wcoj_groupby_root_segment_sum_counts_u32
accumulates the per-row match counts into per-unique-root u64
totals (integer atomicAdd — deterministic). Roots with zero
completions are compacted away. All reduction work is O(n_xy).
Output schema matches the unfused materialize+groupby baseline:
col0 = X (U64), col1 = count (U64).
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.
Sourcepub fn wcoj_triangle_groupby_root_agg_u64_recorded(
&self,
e_xy: &CudaBuffer,
e_yz: &CudaBuffer,
e_xz: &CudaBuffer,
agg_op: AggOp,
value: WcojRootAggValue,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_triangle_groupby_root_agg_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, agg_op: AggOp, value: WcojRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
U64-key aggregate-fused triangle sum/min/max sibling of
Self::wcoj_triangle_groupby_root_agg_u32_recorded: evaluate
q(X, agg(V)) :- e_xy(X,Y), e_yz(Y,Z), e_xz(X,Z) with
agg ∈ {Sum, Min, Max} and V ∈ {Y, Z} over U64 relations,
grouped by the variable-order root X, WITHOUT materializing the
triangle rows.
The recorded groupby is U32/Symbol-key only, so the per-X reduction reuses the WCOJ relation metadata (one unique root per group, e_xy lex-sorted) like the u64 count path:
- the per-op fused kernel accumulates, per e_xy row, a match count
and a u64 aggregate partial (integer atomics — deterministic;
sum wraps on overflow exactly like
groupby_sum_u64; min partials start atu64::MAX, max partials at 0); wcoj_groupby_root_segment_sum_counts_u32reduces per-row match counts to per-unique-root totals (the presence mask), and the per-opwcoj_groupby_root_segment_{sum,min,max}_values_u64kernel folds the per-row partials into per-unique-root u64 aggregates, skipping zero-match rows;- a (X, agg) staging buffer over the unique roots is compacted to count>0 groups.
All reduction work is O(n_xy) — input-sized, never join-output-sized.
Output schema matches the unfused materialize+groupby baseline
(legacy groupby widened to u64 values): col0 = X (U64),
col1 = U64 for sum, min and max alike.
§Errors
XlogError::Kernelifagg_opis not Sum/Min/Max, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.
pub fn wcoj_triangle_hg_count_phase_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<WcojTriangleHgCountPhaseU32>
pub fn wcoj_triangle_hg_materialize_phase_u32_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, count: WcojTriangleHgCountPhaseU32, launch_stream: StreamId, ) -> Result<CudaBuffer>
pub fn wcoj_triangle_hg_u32_with_plan_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, plan: &WcojTriangleHgWorkPlanU32, launch_stream: StreamId, ) -> Result<CudaBuffer>
pub fn wcoj_triangle_hg_work_plan_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojTriangleHgWorkPlanU64>
pub fn wcoj_triangle_hg_u64_recorded( &self, e_xy: &CudaBuffer, e_yz: &CudaBuffer, e_xz: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
pub fn wcoj_4cycle_hg_work_plan_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojCycle4HgWorkPlanU32>
pub fn wcoj_4cycle_hg_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Sourcepub fn wcoj_4cycle_groupby_root_count_u32_recorded(
&self,
e1: &CudaBuffer,
e2: &CudaBuffer,
e3: &CudaBuffer,
e4: &CudaBuffer,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_4cycle_groupby_root_count_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Aggregate-fused 4-cycle group-by-root count: evaluate
q(W, count) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) grouped by the
variable-order root W, WITHOUT materializing the 4-cycle rows.
Pipeline (all recorded; the 4-cycle result never exists as rows):
- the standard 4-cycle histogram-guided work plan;
wcoj_4cycle_groupby_root_count_hg_u32accumulates, per e1 row, a match count (integer atomicAdd — order-insensitive, deterministic values);- a (W, count) staging buffer over the input rows is compacted to count>0 rows (roots with no completion must be absent) and reduced per W via the recorded groupby Sum.
All reduction work is O(n_e1) — input-sized, never join-output-sized.
Output schema matches the unfused materialize+groupby baseline:
col0 = W (e1.col0 type, U32/Symbol), col1 = count (U64).
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.
Sourcepub fn wcoj_4cycle_groupby_root_agg_u32_recorded(
&self,
e1: &CudaBuffer,
e2: &CudaBuffer,
e3: &CudaBuffer,
e4: &CudaBuffer,
agg_op: AggOp,
value: Wcoj4CycleRootAggValue,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_4cycle_groupby_root_agg_u32_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, agg_op: AggOp, value: Wcoj4CycleRootAggValue, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Aggregate-fused 4-cycle group-by-root sum/min/max: evaluate
q(W, agg(V)) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) with
agg ∈ {Sum, Min, Max} and V ∈ {X, Y, Z} grouped by the
variable-order root W, WITHOUT materializing the 4-cycle rows.
Pipeline (all recorded; the 4-cycle result never exists as rows):
- the standard 4-cycle histogram-guided work plan;
- the per-op fused kernel accumulates, per e1 row, a match count
(compaction mask) and the per-row partial aggregate (integer
atomics — order-insensitive, deterministic values). Sum partials
are u64 (a per-row partial can exceed
u32::MAX); min partials start atu32::MAX, max partials at 0; - a 3-column (W, count, agg) staging buffer over the input rows
is compacted to count>0 rows (roots with no completion must be
absent) and reduced per W via the recorded groupby with the same
AggOp(Sum over the u64 partials; Min/Max over u32).
All reduction work is O(n_e1) — input-sized, never join-output-sized.
Output schema matches the unfused materialize+groupby baseline:
col0 = W (e1.col0 type, U32/Symbol), col1 = U64 for Sum,
U32 for Min/Max.
Bag semantics: every (X, Y, Z) completion contributes its value, exactly like aggregating the materialized projection.
§Errors
XlogError::Kernelifagg_opis not Sum/Min/Max, the value column is not plain U32, the manager has no runtime, the launch stream does not resolve, an input is not 2-column U32/Symbol, or any kernel launch fails.
Sourcepub fn wcoj_4cycle_groupby_root_count_u64_recorded(
&self,
e1: &CudaBuffer,
e2: &CudaBuffer,
e3: &CudaBuffer,
e4: &CudaBuffer,
block_work_unit: u32,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_4cycle_groupby_root_count_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
U64-key aggregate-fused 4-cycle count sibling of
Self::wcoj_4cycle_groupby_root_count_u32_recorded: evaluate
q(W, count) :- e1(W,X), e2(X,Y), e3(Y,Z), e4(Z,W) grouped by the
variable-order root W for U64 relations, WITHOUT materializing the
4-cycle rows.
The recorded groupby is U32/Symbol-key only, so the per-W reduction
reuses the WCOJ relation metadata instead (mirroring
Self::wcoj_triangle_groupby_root_count_u64_recorded): e1 is
lex-sorted, so wcoj_build_metadata_u64_recorded yields one
(unique W, group start) pair per root, and
wcoj_groupby_root_segment_sum_counts_u32 accumulates the per-row
match counts into per-unique-root u64 totals (integer atomicAdd —
deterministic). Roots with zero completions are compacted away.
All reduction work is O(n_e1).
Output schema matches the unfused materialize+groupby baseline:
col0 = W (U64), col1 = count (U64).
§Errors
XlogError::Kernelif the manager has no runtime, the launch stream does not resolve, an input is not 2-column U64, or any kernel launch fails.
pub fn wcoj_4cycle_hg_work_plan_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<WcojCycle4HgWorkPlanU64>
pub fn wcoj_4cycle_hg_u64_recorded( &self, e1: &CudaBuffer, e2: &CudaBuffer, e3: &CudaBuffer, e4: &CudaBuffer, block_work_unit: u32, launch_stream: StreamId, ) -> Result<CudaBuffer>
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub fn wcoj_project_2col_swap_recorded(
&self,
src: &CudaBuffer,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_project_2col_swap_recorded( &self, src: &CudaBuffer, launch_stream: StreamId, ) -> Result<CudaBuffer>
Produce an owned 2-col CudaBuffer whose columns are
[src.col(1), src.col(0)]. See module docs for the full
recorded / failure-drain contract.
Used by the variable-ordering dispatcher when a triangle non-default
leader requires a col-swap before
Self::wcoj_layout_u32_recorded sorts the result. The
4-cycle path is rotation-only and never invokes this
helper.
§Errors
XlogError::Kernelif the manager has no runtime, the stream doesn’t resolve, the input isn’t 2-col, or any queued DtoD copy fails. On any failure after the first queued copy, the launch stream is synchronized before the function returns.
Sourcepub fn wcoj_project_output_columns_recorded(
&self,
src: &CudaBuffer,
perm: &[usize],
head_schema: Schema,
launch_stream: StreamId,
) -> Result<CudaBuffer>
pub fn wcoj_project_output_columns_recorded( &self, src: &CudaBuffer, perm: &[usize], head_schema: Schema, launch_stream: StreamId, ) -> Result<CudaBuffer>
Produce an owned N-col CudaBuffer with columns
reordered per perm and the schema replaced with
head_schema.
perm[i] is the source column index that becomes output
column i. The dispatcher uses this post-kernel to remap
the kernel-direct output (in leader’s (a, b, c[, d])
order) into the rule’s head order.
See module docs for the full recorded / failure-drain contract.
§Errors
XlogError::Kernelif the manager has no runtime, the stream doesn’t resolve,perm.len() != head_schema.arity(), anypermindex is ≥src.arity(), or any queued DtoD copy fails. Failure-drain on Err.
Source§impl CudaKernelProvider
impl CudaKernelProvider
Sourcepub const DTOH_SMALL_METADATA_MAX_BYTES: usize = 4096
pub const DTOH_SMALL_METADATA_MAX_BYTES: usize = 4096
Hard cap (in bytes) for Self::dtoh_small_metadata_untracked.
Set deliberately small (4 KB) so the helper cannot become a
general-purpose vector D2H escape hatch — it’s strictly for
classifier histograms and similar small metadata round-trips.
Sourcepub fn new(
device: Arc<CudaDevice>,
memory: Arc<GpuMemoryManager>,
) -> Result<Self>
pub fn new( device: Arc<CudaDevice>, memory: Arc<GpuMemoryManager>, ) -> Result<Self>
Create a new CUDA kernel provider
Loads all kernel modules into the CUDA device. Prefers cubin for the detected SM arch, falls back to portable PTX (sm_75+).
§Arguments
device- The CUDA device to load modules intomemory- The GPU memory manager for kernel allocations
§Errors
Returns XlogError::Kernel if PTX loading fails
§Example
let device = Arc::new(CudaDevice::new(0)?);
let memory = Arc::new(GpuMemoryManager::new(device.clone(), MemoryBudget::default()));
let provider = CudaKernelProvider::new(device, memory)?;Sourcepub fn with_runtime(
device: Arc<CudaDevice>,
memory: Arc<GpuMemoryManager>,
) -> Result<Self>
pub fn with_runtime( device: Arc<CudaDevice>, memory: Arc<GpuMemoryManager>, ) -> Result<Self>
Construct a provider whose GpuMemoryManager must already
have a v0.6 crate::device_runtime::XlogDeviceRuntime
attached via GpuMemoryManager::with_runtime.
Equivalent to Self::new in every respect — same kernel
loading, same field initialization — but rejects managers
that lack a runtime. This guards against the misconfiguration
in which a caller asks for runtime-routed provider semantics
(by calling with_runtime) but supplies a legacy manager
built via GpuMemoryManager::new; without the check, the
resulting provider would silently keep using the cudarc
default allocator and the runtime budget/logging stack would
never observe the allocations the caller expected to be
routed through it.
Note: a runtime-routed manager passed to Self::new still
routes correctly — alloc::<T> and alloc_raw consult
memory.runtime() regardless of which provider constructor
was used. with_runtime exists for callers that want the
requirement enforced at construction time, not for
correctness of the routing itself.
This is the opt-in runtime entry point for providers.
Self::new continues to accept managers without a runtime
(the legacy default) and remains the production constructor
until the runtime stack is certified end-to-end.
§Errors
Returns XlogError::Kernel if memory.runtime() is None,
or anything Self::new would return.
§Example
let device = Arc::new(CudaDevice::new(0)?);
let runtime = Arc::new(XlogDeviceRuntime::with_resource(
Arc::clone(&device),
0,
Arc::new(StreamPool::with_defaults(Arc::clone(&device))),
Box::new(AsyncCudaResource::new(/* ... */)),
));
let memory = Arc::new(GpuMemoryManager::with_runtime(
Arc::clone(&device),
MemoryBudget::default(),
runtime,
));
let provider = CudaKernelProvider::with_runtime(device, memory)?;Sourcepub fn wcoj_layout_fast_path_hit_count(&self) -> u64
pub fn wcoj_layout_fast_path_hit_count(&self) -> u64
Number of times wcoj_layout_*_recorded short-circuited
to the fast-path (recorded clone) instead of running
dedup_full_row_recorded. Increments by 1 per
fast-path hit (3 hits per dispatch when all inputs are
already sorted+unique). Used by tests + the phase
report to confirm the fast-path fired.
Sourcepub fn wcoj_triangle_hg_dispatch_count(&self) -> u64
pub fn wcoj_triangle_hg_dispatch_count(&self) -> u64
Histogram-guided block-slice triangle WCOJ test/diagnostic counter: successful dispatches that routed through the provider entry.
Sourcepub fn reset_wcoj_layout_fast_path_hit_count(&self)
pub fn reset_wcoj_layout_fast_path_hit_count(&self)
Reset the fast-path hit counter to 0. Tests use this to scope counter assertions to a single dispatch.
Sourcepub fn wcoj_layout_sort_invocation_count(&self) -> u64
pub fn wcoj_layout_sort_invocation_count(&self) -> u64
Number of calls to wcoj_layout_sort_*_recorded since the
last reset. Diagnostic-only; used by dispatch-plan certification.
Sourcepub fn reset_wcoj_layout_sort_invocation_count(&self)
pub fn reset_wcoj_layout_sort_invocation_count(&self)
Reset the WCOJ layout-sort invocation counter to 0.
Sourcepub fn kclique_metadata_build_count(&self) -> u64
pub fn kclique_metadata_build_count(&self) -> u64
Number of K-clique leader-edge metadata builds since the last reset.
Sourcepub fn kclique_metadata_build_nanos(&self) -> u64
pub fn kclique_metadata_build_nanos(&self) -> u64
Cumulative nanoseconds spent building K-clique leader-edge metadata since the last reset.
Sourcepub fn reset_kclique_metadata_build_metrics(&self)
pub fn reset_kclique_metadata_build_metrics(&self)
Reset K-clique metadata build diagnostics.
Sourcepub fn device(&self) -> &Arc<CudaDevice>
pub fn device(&self) -> &Arc<CudaDevice>
Get the CUDA device
Sourcepub fn memory(&self) -> &Arc<GpuMemoryManager>
pub fn memory(&self) -> &Arc<GpuMemoryManager>
Get the GPU memory manager
Sourcepub fn ptx_load_profile(&self) -> Option<&PtxLoadProfile>
pub fn ptx_load_profile(&self) -> Option<&PtxLoadProfile>
Get PTX load profiling data (only populated when XLOG_WARMUP_PROFILE=1).
Sourcepub fn reset_host_transfer_stats(&self)
pub fn reset_host_transfer_stats(&self)
Reset tracked host transfer statistics.
Sourcepub fn host_transfer_stats(&self) -> HostTransferStats
pub fn host_transfer_stats(&self) -> HostTransferStats
Snapshot tracked host transfer statistics.
Sourcepub fn host_launch_metadata_transfer_stats(
&self,
) -> HostLaunchMetadataTransferStats
pub fn host_launch_metadata_transfer_stats( &self, ) -> HostLaunchMetadataTransferStats
Snapshot launch-parameter H2D uploads tracked separately from
host_transfer_stats.
Sourcepub fn d2h_transfer_count(&self) -> u64
pub fn d2h_transfer_count(&self) -> u64
Read the column-level D2H transfer counter.
This counter increments once per download_column_* call, enabling
callers (e.g. the ILP trainer) to assert that no column downloads
occurred during a performance-critical section.
Sourcepub fn reset_d2h_transfer_count(&self)
pub fn reset_d2h_transfer_count(&self)
Reset the column-level D2H transfer counter to zero.
Sourcepub fn untracked_metadata_dtoh_count(&self) -> u64
pub fn untracked_metadata_dtoh_count(&self) -> u64
Count of untracked control-plane metadata D2H reads
(dtoh_scalar_untracked + dtoh_small_metadata_untracked).
Sourcepub fn reset_untracked_metadata_dtoh_count(&self)
pub fn reset_untracked_metadata_dtoh_count(&self)
Reset the untracked metadata D2H read counter to zero.
Sourcepub fn enable_strict_deterministic_d2h(&self)
pub fn enable_strict_deterministic_d2h(&self)
Enable the strict deterministic-Datalog D2H gate.
While enabled, any data-plane device-to-host transfer (column downloads
via download_column / download_column_untracked, and any internal
transfer routed through dtoh_sync_copy_into_tracked) increments
CudaKernelProvider::deterministic_d2h_violation_count and returns
XlogError::Execution from the originating call.
Metadata reads via CudaKernelProvider::dtoh_scalar_untracked are
allowed and never trip the gate.
Default is false; the runtime opts in via
RuntimeConfig::strict_deterministic_d2h. v0.5.5 ships the gate
opt-in only — known-violating relational paths (set difference,
join count/materialize) are scheduled for replacement before the
default flips.
Sourcepub fn disable_strict_deterministic_d2h(&self)
pub fn disable_strict_deterministic_d2h(&self)
Disable the strict deterministic-Datalog D2H gate.
Sourcepub fn strict_deterministic_d2h_enabled(&self) -> bool
pub fn strict_deterministic_d2h_enabled(&self) -> bool
Returns whether the strict deterministic-Datalog D2H gate is enabled.
Sourcepub fn deterministic_d2h_violation_count(&self) -> u64
pub fn deterministic_d2h_violation_count(&self) -> u64
Cumulative deterministic-D2H gate violations since the last reset.
Sourcepub fn reset_deterministic_d2h_violations(&self)
pub fn reset_deterministic_d2h_violations(&self)
Reset the deterministic-D2H violation counter to zero.
Sourcepub fn dtoh_small_metadata_untracked<T: DeviceRepr + Default + Copy>(
&self,
src: &TrackedCudaSlice<T>,
count: usize,
) -> Result<Vec<T>>
pub fn dtoh_small_metadata_untracked<T: DeviceRepr + Default + Copy>( &self, src: &TrackedCudaSlice<T>, count: usize, ) -> Result<Vec<T>>
Read a small metadata vector (≤ Self::DTOH_SMALL_METADATA_MAX_BYTES)
from device to host WITHOUT updating the D2H transfer tracker.
Sibling of Self::dtoh_scalar_untracked for callers that need
a few bucket counts (the WCOJ skew classifier reads a 3 × 64 ×
u32 = 768-byte histogram in one go) instead of count separate
scalar reads. Like dtoh_scalar_untracked, this method is
whitelisted by the strict deterministic-D2H gate
(Self::enable_strict_deterministic_d2h) — it does NOT trip
the gate, on purpose, because metadata reads are part of the
determinism contract (just like a scalar total after a scan).
§Hard contract — DO NOT WIDEN THE CAP
The 4 KB cap is the contract. If a caller wants a larger D2H,
it’s a data-plane transfer and must go through the tracked
download_column* path. Widening this cap turns the helper
into a backdoor for tracked-bypass column reads, which would
silently invalidate the strict deterministic-D2H gate.
§Errors
XlogError::Kernelifcount * size_of::<T>()exceedsDTOH_SMALL_METADATA_MAX_BYTES.XlogError::Kernelifcountexceeds the device slice’s length, or if the inner sync copy fails.
Sourcepub fn dtoh_scalar_untracked<T: DeviceRepr + Default + Copy>(
&self,
src: &TrackedCudaSlice<T>,
index: usize,
) -> Result<T>
pub fn dtoh_scalar_untracked<T: DeviceRepr + Default + Copy>( &self, src: &TrackedCudaSlice<T>, index: usize, ) -> Result<T>
Read a single scalar from device to host WITHOUT updating the D2H transfer tracker. Use ONLY for metadata reads (e.g. total_nnz after an exclusive scan), never for data-plane transfers.
This makes the “metadata != data-plane” contract explicit and auditable: callers that bypass tracking must call this method (which is grep-able) rather than reaching for device().inner().
Sourcepub fn htod_sync_copy_into_tracked<T: DeviceRepr, Dst: DevicePtrMut<T>>(
&self,
src: &[T],
dst: &mut Dst,
) -> Result<()>
pub fn htod_sync_copy_into_tracked<T: DeviceRepr, Dst: DevicePtrMut<T>>( &self, src: &[T], dst: &mut Dst, ) -> Result<()>
Upload host data to device while recording data-plane H2D transfer stats.
Sourcepub fn htod_sync_copy_tracked<T: DeviceRepr>(
&self,
src: &[T],
) -> Result<CudaSlice<T>>
pub fn htod_sync_copy_tracked<T: DeviceRepr>( &self, src: &[T], ) -> Result<CudaSlice<T>>
Allocate a CUDA slice from host data while recording data-plane H2D transfer stats.
Sourcepub fn htod_launch_metadata_sync_copy_into<T: DeviceRepr, Dst: DevicePtrMut<T>>(
&self,
src: &[T],
dst: &mut Dst,
) -> Result<()>
pub fn htod_launch_metadata_sync_copy_into<T: DeviceRepr, Dst: DevicePtrMut<T>>( &self, src: &[T], dst: &mut Dst, ) -> Result<()>
Upload bounded launch metadata from host to device while recording it in the launch-metadata subcounter.
Sourcepub fn exclusive_scan_u32_inplace(
&self,
data: &mut TrackedCudaSlice<u32>,
n: u32,
) -> Result<()>
pub fn exclusive_scan_u32_inplace( &self, data: &mut TrackedCudaSlice<u32>, n: u32, ) -> Result<()>
Compute exclusive prefix sum of u8 mask, returns (prefix_sum_vec, total_count)
This is useful for compaction operations where we need to know:
- The output position for each input element (prefix sum)
- The total number of elements that pass the mask (count)
§Arguments
mask- A slice of u8 values (0 or non-zero)
§Returns
A tuple of:
Vec<u32>containing the exclusive prefix sumu32containing the total count of non-zero mask elements
§Example
let mask = vec![1u8, 0, 1, 1, 0, 1];
let (prefix_sum, count) = provider.prefix_sum_mask(&mask)?;
// prefix_sum = [0, 1, 1, 2, 3, 3]
// count = 4§Note
For small inputs (<=256 elements), a CPU scan is used for efficiency. For larger inputs, a three-phase multi-block GPU scan is used.
§Errors
Returns XlogError::Kernel if kernel execution fails
Sourcepub fn device_row_count(&self, buffer: &CudaBuffer) -> Result<usize>
pub fn device_row_count(&self, buffer: &CudaBuffer) -> Result<usize>
Read a buffer’s logical row count, using the host cache when available and falling back to a metadata-only device-to-host read when needed.
Sourcepub fn validated_logical_row_count(&self, buffer: &CudaBuffer) -> Result<usize>
pub fn validated_logical_row_count(&self, buffer: &CudaBuffer) -> Result<usize>
Read and validate a buffer’s logical row count for outward-facing APIs.
This keeps exported/query-visible lengths tied to the device logical row
count while still rejecting impossible metadata (logical_rows > row_cap).
Sourcepub fn create_empty_buffer(&self, schema: Schema) -> Result<CudaBuffer>
pub fn create_empty_buffer(&self, schema: Schema) -> Result<CudaBuffer>
Sourcepub fn create_zero_arity_buffer(
&self,
schema: Schema,
rows: u32,
) -> Result<CudaBuffer>
pub fn create_zero_arity_buffer( &self, schema: Schema, rows: u32, ) -> Result<CudaBuffer>
Create a zero-arity (nullary) relation buffer carrying rows unit tuples.
A nullary relation holds exactly when it has at least one row; its single
possible tuple is the empty tuple (). create_buffer_from_slices with no
column slices routes to create_empty_buffer (0 rows), which represents the
relation as absent — wrong for an asserted nullary fact. Nullary facts must
use this path so presence is materialized as one row.