This document describes XLOG’s performance benchmarking suite, methodology, and baseline metrics.

Running Benchmarks

Prerequisites

  • CUDA-capable NVIDIA GPU (compute capability 7.0+; development device: RTX PRO 3000 Blackwell, SM120)
  • CUDA Toolkit 13.x
  • Decision-DNNF knowledge compiler (for exact inference benchmarks that use external compilation)
  • Sufficient GPU memory (4GB minimum, 12GB recommended for neural-symbolic training)

Quick Start

# Run all benchmarks
cargo bench

# Run specific benchmark suite
cargo bench -p xlog-gpu    # Transitive closure, joins, aggregation
cargo bench -p xlog-prob   # Exact inference, Monte Carlo
cargo bench -p xlog-stats  # Statistics manager
cargo bench -p xlog-solve  # SAT solver

# Run specific benchmark group
cargo bench -p xlog-gpu -- tc_benches
cargo bench -p xlog-gpu -- join_benches

# Generate HTML report
cargo bench -- --save-baseline baseline_name
cargo bench -- --baseline baseline_name

Environment Variables

VariableDescriptionDefault
CUDA_VISIBLE_DEVICESGPU device ordinal0
XLOG_BENCH_MEMORY_MBGPU memory budget4096
WCOJ_BENCH_FULLRun the full WCOJ triangle matrix (adds 100K + 250K row sizes)0
XLOG_USE_WCOJ_TRIANGLE_U32Force-on the WCOJ triangle dispatch (bypasses adaptive classifier)unset
XLOG_USE_WCOJ_TRIANGLE_ADAPTIVEAdaptive (skew-classifier-gated) WCOJ dispatch — set to 0/false to opt out of default-onunset (default-on)
XLOG_DISABLE_WCOJ_TRIANGLEHard kill switch — pins all WCOJ triangle dispatch off, beats every other flagunset

Benchmark Categories

GPU Logic Benchmarks (xlog-gpu)

Location: crates/xlog-gpu/benches/logic_bench.rs

Transitive Closure

Tests recursive query evaluation (semi-naive fixpoint iteration).
BenchmarkDescriptionMetric
tc_chainChain graph 0→1→2→…→nIteration depth
tc_randomRandom sparse graphRows/sec
tc_denseComplete bipartite K_{n,n}Output explosion
Parameters:
  • tc_chain: depth 100, 500, 1000, 2000
  • tc_random: 10K, 100K, 1M edges
  • tc_dense: K_{100,100}, K_{200,200}, K_{500,500}

Hash Join Throughput

Tests GPU hash join kernel performance.
BenchmarkDescriptionMetric
join_throughputVarying cardinalityRows/sec
join_selectivityVarying key rangeOutput rows/input
multiway_join3-way joinIntermediate explosion
Parameters:
  • Cardinalities: 10Kx10K to 1Mx100K
  • Key ranges: 100 (high selectivity) to 100K (low selectivity)
  • Multi-way: 10K, 50K, 100K rows per relation

Aggregation

Tests GROUP BY with COUNT aggregate.
BenchmarkDescriptionMetric
aggregationCOUNT by groupGroups/sec
Parameters:
  • 100K rows with 1K groups
  • 100K rows with 10K groups
  • 1M rows with 10K groups
  • 1M rows with 100K groups
Aggregation throughput is tracked in groups/sec, but the repository does not currently publish a single public pass/fail threshold for this case.

WCOJ Triangle (default-on adaptive, xlog-integration)

Location: crates/xlog-integration/benches/wcoj_triangle_bench.rs Compares the GPU 3-way Worst-Case Optimal Join dispatch against the existing binary-join chain on identical fixtures, across u32, u64, and a Symbol sanity case. Three modes per cell — Off (binary), Force (WCOJ pipeline always), Adaptive (default-on: classifier runs and dispatches WCOJ on high-skew triangles only). The bench overrides each mode explicitly via RuntimeConfig::with_wcoj_triangle_dispatch[_adaptive] to keep the measured path process-global-free. Production callers can pin behavior via the env vars in the table at the top of this file. Run:
# Default matrix (~25 cells; few minutes)
cargo bench -p xlog-integration --bench wcoj_triangle_bench

# Full matrix adds 100K + 250K rows per relation (slow)
WCOJ_BENCH_FULL=1 cargo bench -p xlog-integration --bench wcoj_triangle_bench
Bench GroupFixtureTargets
wcoj_triangle/uniformUniform Erdős-Rényi (key range = rows/10)Average-case baseline
wcoj_triangle/superhubDeterministic super-hub (~50% of edges concentrated on one Y / one X)Histogram-targetable per-thread workload imbalance
wcoj_triangle/emptyThree relations over disjoint key rangesCount→scan→empty fast path
wcoj_triangle/symbol_sanityOne uniform 10K case for SymbolSymbol shares u32’s physical layout — sanity only
Methodology:
  • Timed region = Executor::execute_plan only. Driven via b.iter_custom(...) so the per-iteration loop is owned by the harness. Each cell builds ONE long-lived Executor; put_relation uploads + store.remove("tri") cleanup live OUTSIDE the timed region. The long-lived Executor is required so the executor’s cached wcoj_triangle_stream (OnceLock<StreamId>) is acquired exactly once per cell and reused — a fresh Executor per iteration would drain the runtime’s StreamPool (cap 16, grow-only) past iteration 16.
  • Each (width, fixture, size) cell pre-runs an untimed correctness check: gate=Some(false) (binary-join) and gate=Some(true) (WCOJ) must produce identical row sets (host-side dedup of fixtures aligns the two paths to set semantics). Counter delta is also asserted inside iter_custom: gate=true must increment by iters over the loop, gate=false must increment by 0 — a silent fallback anywhere in the hot loop fails the bench.
  • Bench-only: the StreamPool cap is bumped to 1024 in make_provider (production default 16). The bench has many short-lived correctness-check executors that each acquire one stream; production runs at 16 because each long-lived process has one provider with one cached stream.
  • Baseline numbers, adaptive default-on acceptance, phase-timing evidence, and the post-layout-fast-path results are indexed in the WCOJ bench baseline evidence bundle. Default-on adaptive WCOJ for eligible non-recursive triangle rules ships with XLOG_DISABLE_WCOJ_TRIANGLE=1 as the hard kill switch; the WCOJ subsystem now covers triangles, cost-aware planning, recursive/SCC integration, and K-clique coverage — see the WCOJ architecture guide.

Probabilistic Benchmarks (xlog-prob)

Location: crates/xlog-prob/benches/prob_bench.rs

Exact Inference (Decision-DNNF)

Tests knowledge compilation and weighted model counting.
BenchmarkDescriptionMetric
exact_pathProbabilistic pathCircuits/sec
exact_gridProbabilistic gridCells/sec
exact_bayesianBayesian networkVariables/circuit
exact_gradientsWith gradient computationGrads/sec
Parameters:
  • Path lengths: 5, 10, 15, 20, 25 nodes
  • Grid sizes: 3x3, 4x4, 5x5, 6x6
  • Bayesian: 10, 20, 30, 50 variables

Monte Carlo Inference

Tests GPU-accelerated random sampling.
BenchmarkDescriptionMetric
mc_samplesSample count scalingSamples/sec
mc_varsVariable count scalingWorlds/sec
mc_pathProbabilistic path(samples × vars)/sec
mc_gridProbabilistic grid(samples × cells)/sec
mc_bayesianBayesian network(samples × vars)/sec
Parameters:
  • Sample counts: 1K, 5K, 10K, 50K, 100K
  • AD counts: 10, 50, 100, 500, 1000
  • Path lengths: 10, 25, 50, 100, 200
  • Grid sizes: 5x5, 10x10, 15x15, 20x20
  • Bayesian: 50, 100, 200, 500 variables

Statistics Manager Benchmarks (xlog-stats)

Location: crates/xlog-stats/benches/stats_bench.rs Tests relation registration, cardinality tracking, join estimation.

Solver Benchmarks (xlog-solve)

Location: crates/xlog-solve/benches/solver_bench.rs Tests SAT solving, gradient computation, state management.

Methodology

Measurement Approach

XLOG benchmarks use Criterion.rs for statistically rigorous performance measurement.
SettingValueRationale
Sample size10-100GPU warmup + noise reduction
Warm-up3 iterationsJIT compilation, caching
Significance level0.1Detect 10% regressions
Noise threshold0.05Ignore <5% variance

Throughput Calculation

Throughput = Elements / Time

Elements vary by benchmark type:
- Transitive closure: input edges
- Joins: total input rows (left + right)
- Aggregation: input rows
- Exact inference: circuit variables
- MC inference: samples × variables

Warm-up Protocol

GPU benchmarks include warm-up to ensure:
  1. PTX modules are compiled and cached
  2. Memory pools are initialized
  3. CUDA context is established

Reproducibility

All random data generation uses deterministic seeding:
  • LCG with fixed seed (no system entropy)
  • Same seed produces identical graphs

Baseline Metrics

Development hardware: NVIDIA RTX PRO 3000 Blackwell Generation Laptop GPU (12 GB, SM120, compute capability 12.0, driver 591.59).
Status of the tables below (audited 2026-06-10): the Transitive Closure, Hash Join, Exact Inference, and Monte Carlo tables are aspirational targets, not measured results. No published in-repo run backs them — the Criterion harnesses exist (crates/xlog-gpu/benches/, crates/xlog-prob/benches/) but their output is git-ignored and no baseline has been committed. Do not cite these numbers as evidence. Measured, source-backed results in this repo are: the WCOJ super-hub speedups (10.5×–33.8×, docs/evidence/2026-05-01-wcoj-bench-baseline/) and the neural-symbolic cache ablation below (2.74×, CI-backed, measured 2026-02-18).
Throughput on desktop-class GPUs (e.g. RTX 4090, RTX 5090) will differ due to higher memory bandwidth and SM count.

Transitive Closure (targets — unmeasured)

ConfigurationTargetNotes
100K random edges>1M rows/secSparse graph
1M random edges>5M rows/secMedium graph
K_{500,500} bipartite>10M rows/secDense output

Hash Join (targets — unmeasured)

ConfigurationTargetNotes
100K × 100K>50M rows/secMedium cardinality
1M × 100K>100M rows/secLarge left relation
High selectivity>20M rows/secMany output rows

Exact Inference (targets — unmeasured)

ConfigurationTargetNotes
20-variable path<100msSmall circuit
50-variable Bayesian<500msMedium complexity
With gradients<2× baseBackward pass overhead

Monte Carlo (targets — unmeasured)

ConfigurationTargetNotes
100K samples, 100 vars>10M worlds/secThroughput mode
10K samples, 500 vars>5M worlds/secComplexity mode

Neural-Symbolic Training

Measured on development hardware with 01_minimal (MNIST addition, 512 images, 5 epochs, batch_size=64).
MetricValueNotes
PTX JIT (cold)0.02 sCubin loading (1750x speedup from ~35s)
first_epoch_sec~75 sCold-start (Decision-DNNF compile + verify), warm-starts drop to 0.26s
steady_epoch_sec_mean~0.25 sEpochs 2-5 after warmup (Batched evaluation)
per_query_ms~1.0 msPer-query forward+backward through circuit
Cache speedup2.74xCircuit caching vs no caching (95% CI: [2.29, 3.18])
Evidence: examples/neural/results/evidence/cache_ablation_20260218.json

Interpreting Results

Criterion Output

tc_random/edges/100K_edges
                        time:   [12.345 ms 12.456 ms 12.567 ms]
                        thrpt:  [7.9567 Melem/s 8.0283 Melem/s 8.1003 Melem/s]
                 change: [-2.5% -1.2% +0.1%] (p = 0.12 > 0.10)
                        No change in performance detected.
FieldMeaning
time[lower bound, estimate, upper bound] at 95% CI
thrptThroughput in million elements per second
changeComparison vs baseline
pStatistical significance

Performance Regression Detection

A benchmark is flagged as a regression if:
  1. change lower bound > +5%
  2. p < 0.10

Common Issues

SymptomCauseSolution
High varianceGPU thermal throttlingCool-down period
First run slowJIT compilationIgnore first sample
OOM errorsLarge inputReduce memory budget
Missing benchmarksNo CUDA deviceCheck GPU availability

CI Integration

GitHub Actions Workflow

# .github/workflows/bench.yml
name: Benchmarks

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Run benchmarks
        run: cargo bench --no-fail-fast -- --save-baseline pr_${{ github.sha }}

      - name: Compare to main
        if: github.event_name == 'pull_request'
        run: |
          git fetch origin main
          git checkout origin/main
          cargo bench --no-fail-fast -- --save-baseline main_baseline
          git checkout -
          cargo bench --no-fail-fast -- --baseline main_baseline --load-baseline pr_${{ github.sha }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: target/criterion/

Regression Alerts

CI fails if any benchmark shows:
  • 10% regression vs main branch
  • Statistical significance p < 0.05

Benchmark History

Historical results are stored in:
  • target/criterion/ (local)
  • GitHub Actions artifacts (CI)

Contributing Benchmarks

Adding a New Benchmark

  1. Create benchmark function:
fn bench_new_feature(c: &mut Criterion) {
    let mut group = c.benchmark_group("new_feature");
    group.sample_size(10);

    for size in [100, 1000, 10000].iter() {
        group.throughput(Throughput::Elements(*size as u64));
        group.bench_with_input(
            BenchmarkId::new("size", size),
            size,
            |b, &size| {
                b.iter(|| {
                    // Benchmark code here
                    black_box(operation(size))
                });
            },
        );
    }

    group.finish();
}
  1. Add to criterion group:
criterion_group!(
    name = my_benches;
    config = Criterion::default();
    targets = bench_new_feature
);

criterion_main!(my_benches);
  1. Add to Cargo.toml:
[[bench]]
name = "my_bench"
harness = false

[dev-dependencies]
criterion = "0.5"

Benchmark Guidelines

GuidelineRationale
Use black_box()Prevent dead code elimination
Handle GPU errors gracefullyCI may lack GPU
Use deterministic dataReproducibility
Document expected performanceRegression detection
Keep sample size reasonableCI time budget

Review Checklist

  • Benchmark measures meaningful operation
  • Throughput metric is appropriate
  • Parameters cover realistic range
  • Handles missing GPU gracefully
  • Documentation updated

See Also