XLOG’s deterministic runtime is a host-controlled, GPU data-plane engine. The host executor walks RIR plans, chooses eligible CUDA provider entries, and stores intermediate relations as CUDA buffers. The kernels perform scans, filters, joins, groupby operations, arithmetic expressions, recursive delta steps, and specialized multiway routes.

Host Control, Device Data

xlog-runtime::Executor manages:
  • relation names, relation IDs, generations, and schemas;
  • a RelationStore backed by CudaBuffer values;
  • runtime statistics and join-selectivity observations;
  • persistent build-side join indexes;
  • dispatch counters for optimized routes;
  • recursive SCC state for seed, delta, and merge phases.
xlog-cuda::CudaKernelProvider owns the device-facing operations. It loads CUDA artifacts, allocates tracked slices, launches kernels, and records transfer telemetry where a path needs a no-host-transfer assertion. This means the executor is not itself GPU-resident. The relation state and kernel workspaces are.

RIR Evaluation

The executor evaluates RIR nodes into relation buffers:
RIR workGPU behavior
ScanReturn the current relation buffer from the store.
FilterBuild boolean masks with typed comparison, arithmetic, and boolean kernels, then compact selected rows.
ProjectSelect, reorder, or compute output columns.
JoinDispatch hash join, nested-loop join, WCOJ, Free Join, or a fallback route depending on shape and runtime gates.
GroupbyRun recorded aggregate kernels for supported key/value widths.
Recursive SCCExecute semi-naive seed and delta variants until convergence or a configured iteration limit.
Correctness is defined by row-set parity with the ordinary route. Optimized paths are allowed to decline when a shape, width, budget, or gate does not match.

Predicate And Arithmetic Masks

Filters lower into a mask pipeline:
  1. Arithmetic expression kernels produce temporary columns.
  2. Typed comparison kernels produce boolean masks.
  3. Boolean mask kernels combine predicates with and, or, and not.
  4. Stream compaction writes the filtered output buffer.
The runtime supports scalar integer, floating, boolean, and symbol comparisons according to the types represented in RIR and CUDA provider kernels. Float ordering semantics should be documented only where the corresponding filter tests and kernels cover them.

Joins And Recursion

Ordinary joins remain the baseline path. The runtime can also route specific shapes through specialized kernels:
  • hash joins for normal binary joins;
  • nested-loop joins for small eligible products;
  • WCOJ kernels for recognized triangle, 4-cycle, and clique shapes;
  • Free Join for broader multiway bodies on main, unreleased beyond 0.9.2;
  • factorized recursive-delta routing on main, unreleased beyond 0.9.2.
Each optimized route has a counter. For example, WCOJ, Free Join, fused aggregate, and factorized-delta counters distinguish “the answer matched” from “the optimized route actually fired.”

Ingestion And Diagnostics

Large graph ingestion and delta diagnostics are adjacent runtime surfaces rather than the core RIR loop:
  • xlog_gpu::biokg::StreamingGraphRelationLoader streams JSONL, CSV, and N-Triples graph rows into typed edge records with bounded-memory telemetry.
  • DeltaPlannerTelemetry reports cache reuse, fallback decisions, affected SCCs, recomputed SCCs, and estimated versus measured delta behavior.
  • pyxlog exposes planner telemetry through diagnostic result payloads.
Use these surfaces when investigating data loading and incremental behavior. Use dispatch counters and transfer telemetry when investigating GPU execution.

What To Verify

When you need to prove that a workload used the intended GPU path, check:
  • the route counter for the optimized dispatch;
  • kill-switch parity against the fallback route;
  • transfer telemetry for no-host-transfer claims;
  • CUDA-required validation when the claim depends on actual GPU execution.
Final result equality alone is not enough to prove an optimized GPU route fired.