Benchmarks - XLOG

This document describes XLOG’s performance benchmarking suite, methodology, and baseline metrics.

Running Benchmarks

Prerequisites

CUDA-capable NVIDIA GPU (compute capability 7.0+; development device: RTX PRO 3000 Blackwell, SM120)
CUDA Toolkit 13.x
Decision-DNNF knowledge compiler (for exact inference benchmarks that use external compilation)
Sufficient GPU memory (4GB minimum, 12GB recommended for neural-symbolic training)

Quick Start

# Run all benchmarks
cargo bench

# Run specific benchmark suite
cargo bench -p xlog-gpu    # Transitive closure, joins, aggregation
cargo bench -p xlog-prob   # Exact inference, Monte Carlo
cargo bench -p xlog-stats  # Statistics manager
cargo bench -p xlog-solve  # SAT solver

# Run specific benchmark group
cargo bench -p xlog-gpu -- tc_benches
cargo bench -p xlog-gpu -- join_benches

# Generate HTML report
cargo bench -- --save-baseline baseline_name
cargo bench -- --baseline baseline_name

Environment Variables

Variable	Description	Default
`CUDA_VISIBLE_DEVICES`	GPU device ordinal	`0`
`XLOG_BENCH_MEMORY_MB`	GPU memory budget	`4096`
`WCOJ_BENCH_FULL`	Run the full WCOJ triangle matrix (adds 100K + 250K row sizes)	`0`
`XLOG_USE_WCOJ_TRIANGLE_U32`	Force-on the WCOJ triangle dispatch (bypasses adaptive classifier)	unset
`XLOG_USE_WCOJ_TRIANGLE_ADAPTIVE`	Adaptive (skew-classifier-gated) WCOJ dispatch — set to `0`/`false` to opt out of default-on	unset (default-on)
`XLOG_DISABLE_WCOJ_TRIANGLE`	Hard kill switch — pins all WCOJ triangle dispatch off, beats every other flag	unset

Benchmark Categories

GPU Logic Benchmarks (`xlog-gpu`)

Location: crates/xlog-gpu/benches/logic_bench.rs

Transitive Closure

Tests recursive query evaluation (semi-naive fixpoint iteration).

Benchmark	Description	Metric
`tc_chain`	Chain graph 0→1→2→…→n	Iteration depth
`tc_random`	Random sparse graph	Rows/sec
`tc_dense`	Complete bipartite `K_{n,n}`	Output explosion

Parameters:

tc_chain: depth 100, 500, 1000, 2000
tc_random: 10K, 100K, 1M edges
tc_dense: K_{100,100}, K_{200,200}, K_{500,500}

Hash Join Throughput

Tests GPU hash join kernel performance.

Benchmark	Description	Metric
`join_throughput`	Varying cardinality	Rows/sec
`join_selectivity`	Varying key range	Output rows/input
`multiway_join`	3-way join	Intermediate explosion

Parameters:

Cardinalities: 10Kx10K to 1Mx100K
Key ranges: 100 (high selectivity) to 100K (low selectivity)
Multi-way: 10K, 50K, 100K rows per relation

Aggregation

Tests GROUP BY with COUNT aggregate.

Benchmark	Description	Metric
`aggregation`	COUNT by group	Groups/sec

Parameters:

100K rows with 1K groups
100K rows with 10K groups
1M rows with 10K groups
1M rows with 100K groups

Aggregation throughput is tracked in groups/sec, but the repository does not currently publish a single public pass/fail threshold for this case.

WCOJ Triangle (default-on adaptive, `xlog-integration`)

Location: crates/xlog-integration/benches/wcoj_triangle_bench.rs Compares the GPU 3-way Worst-Case Optimal Join dispatch against the existing binary-join chain on identical fixtures, across u32, u64, and a Symbol sanity case. Three modes per cell — Off (binary), Force (WCOJ pipeline always), Adaptive (default-on: classifier runs and dispatches WCOJ on high-skew triangles only). The bench overrides each mode explicitly via RuntimeConfig::with_wcoj_triangle_dispatch[_adaptive] to keep the measured path process-global-free. Production callers can pin behavior via the env vars in the table at the top of this file. Run:

# Default matrix (~25 cells; few minutes)
cargo bench -p xlog-integration --bench wcoj_triangle_bench

# Full matrix adds 100K + 250K rows per relation (slow)
WCOJ_BENCH_FULL=1 cargo bench -p xlog-integration --bench wcoj_triangle_bench

Bench Group	Fixture	Targets
`wcoj_triangle/uniform`	Uniform Erdős-Rényi (key range = rows/10)	Average-case baseline
`wcoj_triangle/superhub`	Deterministic super-hub (~50% of edges concentrated on one Y / one X)	Histogram-targetable per-thread workload imbalance
`wcoj_triangle/empty`	Three relations over disjoint key ranges	Count→scan→empty fast path
`wcoj_triangle/symbol_sanity`	One uniform 10K case for `Symbol`	Symbol shares u32’s physical layout — sanity only

Methodology:

Timed region = Executor::execute_plan only. Driven via b.iter_custom(...) so the per-iteration loop is owned by the harness. Each cell builds ONE long-lived Executor; put_relation uploads + store.remove("tri") cleanup live OUTSIDE the timed region. The long-lived Executor is required so the executor’s cached wcoj_triangle_stream (OnceLock<StreamId>) is acquired exactly once per cell and reused — a fresh Executor per iteration would drain the runtime’s StreamPool (cap 16, grow-only) past iteration 16.
Each (width, fixture, size) cell pre-runs an untimed correctness check: gate=Some(false) (binary-join) and gate=Some(true) (WCOJ) must produce identical row sets (host-side dedup of fixtures aligns the two paths to set semantics). Counter delta is also asserted inside iter_custom: gate=true must increment by iters over the loop, gate=false must increment by 0 — a silent fallback anywhere in the hot loop fails the bench.
Bench-only: the StreamPool cap is bumped to 1024 in make_provider (production default 16). The bench has many short-lived correctness-check executors that each acquire one stream; production runs at 16 because each long-lived process has one provider with one cached stream.
Baseline numbers, adaptive default-on acceptance, phase-timing evidence, and the post-layout-fast-path results are indexed in the WCOJ bench baseline evidence bundle. Default-on adaptive WCOJ for eligible non-recursive triangle rules ships with XLOG_DISABLE_WCOJ_TRIANGLE=1 as the hard kill switch; the WCOJ subsystem now covers triangles, cost-aware planning, recursive/SCC integration, and K-clique coverage — see the WCOJ architecture guide.

Probabilistic Benchmarks (`xlog-prob`)

Location: crates/xlog-prob/benches/prob_bench.rs

Exact Inference (Decision-DNNF)

Tests knowledge compilation and weighted model counting.

Benchmark	Description	Metric
`exact_path`	Probabilistic path	Circuits/sec
`exact_grid`	Probabilistic grid	Cells/sec
`exact_bayesian`	Bayesian network	Variables/circuit
`exact_gradients`	With gradient computation	Grads/sec

Parameters:

Path lengths: 5, 10, 15, 20, 25 nodes
Grid sizes: 3x3, 4x4, 5x5, 6x6
Bayesian: 10, 20, 30, 50 variables

Monte Carlo Inference

Tests GPU-accelerated random sampling.

Benchmark	Description	Metric
`mc_samples`	Sample count scaling	Samples/sec
`mc_vars`	Variable count scaling	Worlds/sec
`mc_path`	Probabilistic path	(samples × vars)/sec
`mc_grid`	Probabilistic grid	(samples × cells)/sec
`mc_bayesian`	Bayesian network	(samples × vars)/sec

Parameters:

Sample counts: 1K, 5K, 10K, 50K, 100K
AD counts: 10, 50, 100, 500, 1000
Path lengths: 10, 25, 50, 100, 200
Grid sizes: 5x5, 10x10, 15x15, 20x20
Bayesian: 50, 100, 200, 500 variables

Statistics Manager Benchmarks (`xlog-stats`)

Location: crates/xlog-stats/benches/stats_bench.rs Tests relation registration, cardinality tracking, join estimation.

Solver Benchmarks (`xlog-solve`)

Location: crates/xlog-solve/benches/solver_bench.rs Tests SAT solving, gradient computation, state management.

Methodology

Measurement Approach

XLOG benchmarks use Criterion.rs for statistically rigorous performance measurement.

Setting	Value	Rationale
Sample size	10-100	GPU warmup + noise reduction
Warm-up	3 iterations	JIT compilation, caching
Significance level	0.1	Detect 10% regressions
Noise threshold	0.05	Ignore `<5%` variance

Throughput Calculation

Throughput = Elements / Time

Elements vary by benchmark type:
- Transitive closure: input edges
- Joins: total input rows (left + right)
- Aggregation: input rows
- Exact inference: circuit variables
- MC inference: samples × variables

Warm-up Protocol

GPU benchmarks include warm-up to ensure:

PTX modules are compiled and cached
Memory pools are initialized
CUDA context is established

Reproducibility

All random data generation uses deterministic seeding:

LCG with fixed seed (no system entropy)
Same seed produces identical graphs

Baseline Metrics

Development hardware: NVIDIA RTX PRO 3000 Blackwell Generation Laptop GPU (12 GB, SM120, compute capability 12.0, driver 591.59).

Status of the tables below (audited 2026-06-10): the Transitive Closure, Hash Join, Exact Inference, and Monte Carlo tables are aspirational targets, not measured results. No published in-repo run backs them — the Criterion harnesses exist (crates/xlog-gpu/benches/, crates/xlog-prob/benches/) but their output is git-ignored and no baseline has been committed. Do not cite these numbers as evidence. Measured, source-backed results in this repo are: the WCOJ super-hub speedups (10.5×–33.8×, docs/evidence/2026-05-01-wcoj-bench-baseline/) and the neural-symbolic cache ablation below (2.74×, CI-backed, measured 2026-02-18).

Throughput on desktop-class GPUs (e.g. RTX 4090, RTX 5090) will differ due to higher memory bandwidth and SM count.

Transitive Closure (targets — unmeasured)

Configuration	Target	Notes
100K random edges	>1M rows/sec	Sparse graph
1M random edges	>5M rows/sec	Medium graph
`K_{500,500}` bipartite	>10M rows/sec	Dense output

Hash Join (targets — unmeasured)

Configuration	Target	Notes
100K × 100K	>50M rows/sec	Medium cardinality
1M × 100K	>100M rows/sec	Large left relation
High selectivity	>20M rows/sec	Many output rows

Exact Inference (targets — unmeasured)

Configuration	Target	Notes
20-variable path	`<100ms`	Small circuit
50-variable Bayesian	`<500ms`	Medium complexity
With gradients	`<2× base`	Backward pass overhead

Monte Carlo (targets — unmeasured)

Configuration	Target	Notes
100K samples, 100 vars	>10M worlds/sec	Throughput mode
10K samples, 500 vars	>5M worlds/sec	Complexity mode

Neural-Symbolic Training

Measured on development hardware with 01_minimal (MNIST addition, 512 images, 5 epochs, batch_size=64).

Metric	Value	Notes
`PTX JIT (cold)`	0.02 s	Cubin loading (1750x speedup from ~35s)
`first_epoch_sec`	~75 s	Cold-start (Decision-DNNF compile + verify), warm-starts drop to 0.26s
`steady_epoch_sec_mean`	~0.25 s	Epochs 2-5 after warmup (Batched evaluation)
`per_query_ms`	~1.0 ms	Per-query forward+backward through circuit
Cache speedup	2.74x	Circuit caching vs no caching (95% CI: [2.29, 3.18])

Evidence: examples/neural/results/evidence/cache_ablation_20260218.json

Interpreting Results

Criterion Output

tc_random/edges/100K_edges
                        time:   [12.345 ms 12.456 ms 12.567 ms]
                        thrpt:  [7.9567 Melem/s 8.0283 Melem/s 8.1003 Melem/s]
                 change: [-2.5% -1.2% +0.1%] (p = 0.12 > 0.10)
                        No change in performance detected.

Field	Meaning
`time`	[lower bound, estimate, upper bound] at 95% CI
`thrpt`	Throughput in million elements per second
`change`	Comparison vs baseline
`p`	Statistical significance

Performance Regression Detection

A benchmark is flagged as a regression if:

change lower bound > +5%
p < 0.10

Common Issues

Symptom	Cause	Solution
High variance	GPU thermal throttling	Cool-down period
First run slow	JIT compilation	Ignore first sample
OOM errors	Large input	Reduce memory budget
Missing benchmarks	No CUDA device	Check GPU availability

CI Integration

GitHub Actions Workflow

# .github/workflows/bench.yml
name: Benchmarks

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  benchmark:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Run benchmarks
        run: cargo bench --no-fail-fast -- --save-baseline pr_${{ github.sha }}

      - name: Compare to main
        if: github.event_name == 'pull_request'
        run: |
          git fetch origin main
          git checkout origin/main
          cargo bench --no-fail-fast -- --save-baseline main_baseline
          git checkout -
          cargo bench --no-fail-fast -- --baseline main_baseline --load-baseline pr_${{ github.sha }}

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: target/criterion/

Regression Alerts

CI fails if any benchmark shows:

10% regression vs main branch
Statistical significance p < 0.05

Benchmark History

Historical results are stored in:

target/criterion/ (local)
GitHub Actions artifacts (CI)

Contributing Benchmarks

Adding a New Benchmark

Create benchmark function:

fn bench_new_feature(c: &mut Criterion) {
    let mut group = c.benchmark_group("new_feature");
    group.sample_size(10);

    for size in [100, 1000, 10000].iter() {
        group.throughput(Throughput::Elements(*size as u64));
        group.bench_with_input(
            BenchmarkId::new("size", size),
            size,
            |b, &size| {
                b.iter(|| {
                    // Benchmark code here
                    black_box(operation(size))
                });
            },
        );
    }

    group.finish();
}

Add to criterion group:

criterion_group!(
    name = my_benches;
    config = Criterion::default();
    targets = bench_new_feature
);

criterion_main!(my_benches);

Add to Cargo.toml:

[[bench]]
name = "my_bench"
harness = false

[dev-dependencies]
criterion = "0.5"

Benchmark Guidelines

Guideline	Rationale
Use `black_box()`	Prevent dead code elimination
Handle GPU errors gracefully	CI may lack GPU
Use deterministic data	Reproducibility
Document expected performance	Regression detection
Keep sample size reasonable	CI time budget

Review Checklist

Benchmark measures meaningful operation
Throughput metric is appropriate
Parameters cover realistic range
Handles missing GPU gracefully
Documentation updated

​Running Benchmarks

​Prerequisites

​Quick Start

​Environment Variables

​Benchmark Categories

​GPU Logic Benchmarks (xlog-gpu)

​Transitive Closure

​Hash Join Throughput

​Aggregation

​WCOJ Triangle (default-on adaptive, xlog-integration)

​Probabilistic Benchmarks (xlog-prob)

​Exact Inference (Decision-DNNF)

​Monte Carlo Inference

​Statistics Manager Benchmarks (xlog-stats)

​Solver Benchmarks (xlog-solve)

​Methodology

​Measurement Approach

​Throughput Calculation

​Warm-up Protocol

​Reproducibility

​Baseline Metrics

​Transitive Closure (targets — unmeasured)

​Hash Join (targets — unmeasured)

​Exact Inference (targets — unmeasured)

​Monte Carlo (targets — unmeasured)

​Neural-Symbolic Training

​Interpreting Results

​Criterion Output

​Performance Regression Detection

​Common Issues

​CI Integration

​GitHub Actions Workflow

​Regression Alerts

​Benchmark History

​Contributing Benchmarks

​Adding a New Benchmark

​Benchmark Guidelines

​Review Checklist

​See Also

Running Benchmarks

Prerequisites

Quick Start

Environment Variables

Benchmark Categories

GPU Logic Benchmarks (`xlog-gpu`)

Transitive Closure

Hash Join Throughput

Aggregation

WCOJ Triangle (default-on adaptive, `xlog-integration`)

Probabilistic Benchmarks (`xlog-prob`)

Exact Inference (Decision-DNNF)

Monte Carlo Inference

Statistics Manager Benchmarks (`xlog-stats`)

Solver Benchmarks (`xlog-solve`)

Methodology

Measurement Approach

Throughput Calculation

Warm-up Protocol

Reproducibility

Baseline Metrics

Transitive Closure (targets — unmeasured)

Hash Join (targets — unmeasured)

Exact Inference (targets — unmeasured)

Monte Carlo (targets — unmeasured)

Neural-Symbolic Training

Interpreting Results

Criterion Output

Performance Regression Detection

Common Issues

CI Integration

GitHub Actions Workflow

Regression Alerts

Benchmark History

Contributing Benchmarks

Adding a New Benchmark

Benchmark Guidelines

Review Checklist

See Also