Probabilistic Engines

xlog-prob is XLOG’s probabilistic reasoning tier. It consumes a probabilistic .xlog program (probabilistic facts, annotated disjunctions, evidence, and probabilistic queries) and evaluates query probabilities either:

Exactly via GPU-native knowledge compilation (prob_engine=exact_ddnnf): PIR → GPU CNF → GPU Decision-DNNF compiler → XGCF circuit → GPU weighted model counting + gradients.
Approximately via Monte Carlo sampling (prob_engine=mc): GPU sampling of probabilistic leaves + deterministic evaluation per sampled world, with uncertainty reporting.

Language Surface (Probabilistic Profile)

The probabilistic surface is parsed by the xlog-logic frontend and represented in the AST. Supported constructs:

Probabilistic facts: p::atom(...). (Bernoulli)
Annotated disjunctions (AD): p1::a1(...); p2::a2(...). (categorical; optional “none” outcome if probabilities sum to < 1)
Evidence: evidence(atom(...), true|false).
Queries: query(atom(...)).
Engine selection: #pragma prob_engine = exact_ddnnf|mc (API overrides take precedence in Python)

Engine Selection and Contracts

`prob_engine=exact_ddnnf` (exact)

Primary goal: compute exact conditional probabilities P(Q|E) and (optionally) gradients w.r.t. probabilistic leaf log-weights on GPU. Current semantic constraints (enforced by implementation):

Negation is fully supported via NNF transformation and Well-Founded Semantics:
- Stratified negation: automatic layer detection and two-valued evaluation
- Non-monotone (cyclic) negation: WFS three-valued semantics (True/False/Undefined)
- Gradients flow through negated literals with correct sign flip
Recursive programs are supported; provenance is constructed as an acyclic PIR by semi-naive, iteration-indexed unrolling.
Aggregation is not supported in exact inference rule bodies.

If a program uses aggregation, use prob_engine=mc.

`prob_engine=mc` (approximate Monte Carlo)

Primary goal: provide a robust, explicit escape hatch for:

non-monotone recursion (cycles through not and/or aggregates), and
probabilistic programs that are not supported by the exact provenance compiler.

Outputs are marked approximate and include uncertainty metadata (standard error + confidence interval, sample counts, and seed). Non-monotone SCC semantics: xlog_prob::mc::NONMONOTONE_SEMANTICS (also surfaced to Python results).

Engine split + fail-closed contract: the production MC path is the GPU-resident megakernel engine, which rejects negation, aggregates, and other unbounded constructs with a typed ResidentRejection. Programs in that fragment do NOT run on the GPU: McProgram::evaluate fails closed with the rejection unless the caller explicitly sets McEvalConfig::allow_cpu_oracle_fallback (CLI: --allow-cpu-oracle; Python: evaluate(..., allow_cpu_oracle=True)), in which case the CPU oracle evaluates the program and the result is labeled McResult::engine = McEngine::CpuOracle (mc_engine: "cpu-oracle" in CLI JSON / Python metadata). CPU-oracle results are never valid GPU-native or zero-host evidence.

Epistemic Integration Contract

The epistemic integration contract treats accepted world views as the only valid source of epistemic probabilistic evidence. Raw generated guesses must not bypass world-view validation and enter the PIR or circuit layers as hidden rewrites. The bounded fixture API lives in xlog_prob::epistemic and documents the current contract:

EpistemicAssumption maps epistemic choices to compiler-facing evidence literals such as know:rain/0=true.
AcceptedWorldViewEvidence couples evidence assumptions to a non-empty accepted EpistemicWorldView; callers use it after semantic validation has already accepted the world view.
EpistemicCircuit keeps a compiled circuit fingerprint, active epistemic evidence, and deterministic query probability for fixture-scale integration tests.
KnowledgeCompilerAdapter::gpu_d4() represents the existing GPU Decision-DNNF/XGCF path and supports incremental evidence updates.
KnowledgeCompilerAdapter::external_ddnnf_text(...) records an alternative external Decision-DNNF text adapter design. It consumes DIMACS CNF and emits Decision-DNNF text, but is explicitly DesignOnly in this slice.
conditional_probability_from_logs normalizes log P(Q and E) and log P(E) with EPISTEMIC_PROBABILITY_TOLERANCE = 1e-12, clamping only values within that documented tolerance.

When an adapter supports incremental evidence, changing an epistemic assumption updates active evidence without changing the circuit fingerprint or incrementing the compile count. Adapters that do not support incremental evidence must report a full rebuild. This preserves coherent query semantics for the semantic oracle. The production-facing bridge is xlog_prob::epistemic_production::EpistemicProbProductionAdapter. It is a thin adapter over ExactDdnnfProgram: callers must provide accepted world-view evidence before source or parsed programs are compiled through the existing GPU-native exact/provenance path. Its trace reports GPU exact compiles, accepted-evidence consumption, optional GPU gradient evaluations, and hard-zero CPU-only probability recomputation and fixture-circuit counters. The corrected production gate is stricter: accepted world-view evidence must flow into the existing GPU-native exact/provenance path without CPU-only probability recomputation in the accepted execution path. The current xlog_prob::epistemic fixtures do not by themselves close that gate, and the production adapter is partial reuse evidence until accepted runtime world views are wired through it end to end.

Exact Path (`ExactDdnnfProgram`)

Pipeline Overview

Parse + stratify (xlog_logic::parse_program, then provenance lowering).
Provenance extraction → PIR graph:
- probabilistic facts become PIR leaf literals with probabilities (leaf_probs)
- annotated disjunctions become a chain of Bernoulli decision variables (choice_probs)
- derived tuples map to PIR formulas (tuple_formulas)
GPU PIR → CNF (encode_cnf_gpu) with a device-resident var map.
GPU Decision-DNNF compile + verify: CNF → device-resident XGCF with cache storage (compile_gpu_d4_and_verify_cached).
GPU evaluation via cache-aware kernels:
- forward pass computes log WMC(...) in log-space
- backward pass computes gradients w.r.t. leaf log-weights

Conditional probability

For each query variable Q and evidence E:

log Z_E = log WMC(E)
log Z_EQ = log WMC(E ∧ Q)
log P(Q|E) = log Z_EQ − log Z_E

This is implemented by ExactDdnnfProgram::evaluate and ExactDdnnfProgram::evaluate_gpu_with_grads.

GPU state and caching

ExactDdnnfProgram compiles CNF on the GPU, invokes GPU Decision-DNNF + GPU CDCL verification, and stores the resulting circuit in a device-resident GpuCircuitCache. The program holds a cache handle and CUDA provider in GpuExactState; evaluations reuse the cached slot and run cache-aware XGCF kernels with no CPU Decision-DNNF invocation and no CNF/DDNNF host materialization.

Orchestration boundary (honest residency statement)

XGCF forward evaluation launches one kernel per circuit level from a host loop (eval_log_wmc_device_inplace). On the GPU Decision-DNNF-native path the loop reads no per-level data back from the device (level sizing uses device-resident arrays; the host-side level_offsets mirror is populated only for host-uploaded circuits), and results stay on device until the caller downloads O(1) scalars (logZ, per-query gradients) after evaluation. Unlike the MC resident megakernel, exact inference is therefore GPU-accelerated with host-orchestrated level launches, not a single-launch device-resident engine — do not cite it under the MC engine’s no-host-interaction measured contract.

CPU Decision-DNNF compilation (removed)

The codebase does not include a CPU Decision-DNNF compilation pipeline:

No vendored Decision-DNNF compiler snapshot.
No shell-out to an external d4 binary.

Decision-DNNF parsing remains available for tests/fixtures only; production exact inference uses the GPU-native compiler + verifier and device-resident circuits. The GPU-native encoder (encode_cnf_gpu) produces a device-resident GpuCnf for the GPU Decision-DNNF/CDCL pipeline and is wired into ExactDdnnfProgram via compile_gpu_d4_and_verify_cached with a device-resident GpuCircuitCache.

GPU-Native Compilation + Verification

XLOG’s target architecture is a 100% GPU-native compilation + verification path (GPU Decision-DNNF + GPU CDCL verifier) with zero data-plane host transfers. This path is integrated into ExactDdnnfProgram:

xlog_prob::compilation::validate_equivalence_gpu proves φ ≡ C by solving two UNSAT queries on GPU:
- UNSAT(φ ∧ ¬C)
- UNSAT(C ∧ ¬φ)

Verifier contract:

Zero device→host reads in the production verifier path (the host never downloads SAT/UNSAT status).
Fail-fast on mismatch: GPU-side assertion/validation kernels trap; the host observes only CUDA success/failure.
Capacity-safe CNF handling: CNF buffers may be allocated with capacity greater than exact size; all index math uses device-resident GpuCnf::{num_vars,num_clauses,num_lits}.

Cache + integration is implemented: a GPU-resident circuit cache, cache-aware XGCF evaluation, GPU-only exact compilation (compile_gpu_d4_and_verify_cached), and guardrails enforcing no device→host reads in the cache path.

Monte Carlo Path (`McProgram`)

Sampling plan

McProgram compiles probabilistic leaves into a flat Bernoulli vector (bernoulli_probs: Vec<f32>):

each probabilistic fact is a direct Bernoulli variable
each annotated disjunction is encoded as a chain of conditional Bernoulli decisions (matching the exact/provenance lowering):
- for categorical probabilities (p0, p1, …, pk) the chain samples k-1 Bernoullis with conditional probabilities p_i / remaining
- if the probabilities sum to < 1, an explicit “none” outcome is represented by the remaining mass

This compilation is implemented by compile_sampling_plan.

GPU sampling

Sampling uses CudaKernelProvider::sample_bernoulli_matrix_device(...), which calls a dedicated sampling kernel to generate a row-major [samples][num_vars] matrix of 0/1 bytes on the GPU. Sampling is deterministic given seed.

GPU-resident execution engine (device-only counts)

The MC execution path is the GPU-resident Datalog engine, exposed via:

McProgram::evaluate_gpu_device(...) -> McDeviceResult
McProgram::evaluate_gpu_device_with_provider(...) -> McDeviceResult (caller-supplied provider)
McProgram::evaluate_resident_with_provider(...) -> McResidentResult (raw device result + no-host stats)

It evaluates all sampled worlds in a single megakernel launch and returns device-resident count buffers (query_counts, evidence_count) plus a per-world iter_trace. The sample/world id is the CUDA grid dimension. Sampled facts, derived relations, evidence flags, query counts, sparse row counts, sparse world offsets, and fixpoint state are all device-resident. The current sparse/WCOJ resident engine stores each world in a preallocated columnar arena (slot, arg0, arg1, arg2) with kernel-populated device row counters, device world offsets, convergence flags, and sparse-arena overflow flags. A dense bounded bitset remains only as a device-side membership index for duplicate suppression and query/evidence membership checks; it is no longer the only execution proof. Generic positive joins are evaluated from the sparse row arena by binding variables from matching body rows and appending derived heads with device atomic cursors. Recursive programs converge in the same megakernel through a device-side double-buffered fixpoint. The default path uses one CUDA block per world; setting XLOG_MC_RESIDENT_BLOCKS_PER_WORLD>1 before setup switches the same kernel to a cooperative multi-block-per-world launch with grid synchronization, explicit device fences around cooperative barriers, atomic reads of device change/continue flags, and device-written block participation counters. The final per-world convergence state is written to device memory, and no host read drives control flow. The supported production fragment remains bounded (arity <= 3, body <= 3, <= 8 variables). Programs that would exceed a configured deterministic resident arena budget fail closed before device allocation with XlogError::ResourceExhausted (resident_resource_budget, bound_bytes, budget_bytes); there is no CPU or host-sizing fallback.

No host interaction in the measured region

The predecessor no-host-loop implementation only removed tracked data-plane host-to-device/device-to-host transfers from a host loop that still orchestrated per sample: a Rust for sample_idx loop, per-sample kernel launches, and per-sample untracked dtoh_scalar_untracked row-count reads. That is not the guarantee here. The resident engine’s measured region is a single megakernel launch (+ one post-launch sync). It has zero host interaction: 0 tracked host-to-device, 0 tracked device-to-host, 0 untracked metadata reads (dtoh_scalar_untracked is never called in-region — a dedicated untracked_metadata_dtoh_count provider counter proves it), 0 host loop iterations, 0 per-sample host launches. This is measured via McNoHostStats (snapshots around the launch) and proven constant across N=128 and N=1024 (nothing scales per-sample on the host). Static setup (the seeded sample matrix, force arrays, plan upload, buffer allocation) happens before the region; final aggregates are read only after it. There is no host-orchestrated fallback: programs outside the supported fragment (bounded-domain positive Datalog — arity ≤ 3, body ≤ 3, ≤ 8 vars, bounded universe) fail closed before execution with a typed ResidentRejection (kind + construct + context). The CPU path (evaluate_cpu) is a seed-matched test oracle only — never accepted execution. Acceptance scope. GPU-native no-host evidence comes from the resident-engine test suite (seven exact sparse resident pilots, including ternary-relation input, with device sparse row-count evidence plus device-written offsets/convergence/overflow diagnostics and a cooperative two-block-per-world recursion pilot, recursion with a non-base derived tuple and more than one fixpoint iteration, plus fail-closed negative tests) and the device-count test gates. A full cargo test -p xlog-prob run includes CPU-oracle/host-output tests and is not no-host evidence.

Evidence conditioning and uncertainty reporting

mc uses rejection sampling for evidence:

Only samples satisfying evidence(...) are counted (evidence_samples).
If evidence is present and never satisfied, evaluation fails with a deterministic error.

For each query, mc estimates P(Q|E) as a binomial proportion and reports:

prob, log_prob
standard error (stderr)
two-sided confidence interval (ci_low, ci_high) for the configured confidence
samples, evidence_samples, seed
non-monotone SCC diagnostics (nonmonotone_sccs, nonmonotone_cycles, nonmonotone_iteration_limit_hits)

Host outputs and CPU validation (feature-gated)

Host-readable probability outputs are behind --features host-io:

McProgram::evaluate(...) -> McResult (and its alias evaluate_gpu(...)) run the GPU-native device loop and then materialize the result on the host by downloading the final query/evidence counts after the hot loop. This is host-result materialization, not a hot-loop transfer — the _gpu suffix marks GPU compute, it does not imply a zero-host result.
McProgram::evaluate_cpu(...) -> McResult is a CPU oracle/debug path that downloads the full sample matrix and evaluates worlds on the host. It is never GPU-native and never release evidence; it exists only as a seed-matched oracle.

Python API (`pyxlog`)

The PyO3 extension exposes two entry points:

Program.compile(source, device=0, memory_mb=32768, prob_engine=None) -> CompiledProgram
- CompiledProgram.evaluate(return_grads=False, ...) -> EvalResult
- outputs probabilities as DLPack tensors (prob, log_prob)
- exact engine optionally returns per-query gradients (grad_true, grad_false)
- MC engine returns uncertainty metadata and sets approx=True
LogicProgram.compile(source, device=0, memory_mb=32768) -> CompiledLogicProgram
- CompiledLogicProgram.evaluate(dlpack_inputs={...}) -> LogicEvalResult

All GPU table interchange is via DLPack capsules (framework-agnostic). The repository’s examples/python/ directory contains end-to-end scripts. For training workloads with neural predicates, pyxlog uses the GPU neural fast-path:

neural outputs are imported as CUDA tensors via DLPack (no .tolist()),
AD-chain weights and probability gradients are computed on GPU,
Torch receives device-resident gradients via output.backward(grad).
The strict GPU-native entrypoint is CompiledProgram.forward_backward_tensor(query) -> torch.Tensor which returns a CUDA scalar loss (no device→host reads required). The legacy forward_backward(query) -> f64 helper reads back a single scalar for convenience.

Provenance Primitives (Rust-only)

xlog-prob exposes retained provenance metadata so external Rust consumers can resolve PIR nodes back to their source atoms and annotated-disjunction metadata. All data is retained inline during extraction — no new passes or post-hoc reconstruction.

Types

ChoiceSource — captures explicit annotated-disjunction heads (with probabilities), the Bernoulli decision stage index (choice_index), and an optional AD identity (source_id, None in the initial API).

Fields on `Provenance`

leaf_atoms: BTreeMap<LeafId, GroundAtom> — one entry per probabilistic fact leaf
choice_sources: BTreeMap<ChoiceVarId, ChoiceSource> — one entry per Bernoulli decision variable

Accessors

leaf_atom(LeafId) -> Option<&GroundAtom> — resolve a PIR leaf to its source atom
choice_source(ChoiceVarId) -> Option<&ChoiceSource> — resolve a decision node to AD metadata
atoms_with_formulas() -> impl Iterator<Item = (&GroundAtom, PirNodeId)> — iterate all atoms with provenance formulas (exposes tuple_formulas read-only)

Re-exports

xlog-prob re-exports key types at crate root:

pub use pir::{ChoiceVarId, LeafId, PirGraph, PirNode, PirNodeId};
pub use provenance::{ChoiceSource, GroundAtom, Provenance, Value};

Use CHANGELOG.md and the crate-level API docs as the authoritative references for the provenance API.

Reproducibility

# Rust (workspace)
cargo test --workspace --all-targets --exclude pyxlog --release

# Rust (`xlog-prob` focused)
cargo test -p xlog-prob --all-targets --release

# CUDA certification suite (release)
cargo test -p xlog-cuda-tests --test certification_suite --release -- --nocapture

# Python examples (local wheel)
python scripts/install_pyxlog_for_python.py --python /usr/local/bin/python --user
python examples/python/03_prob_mc_nonmonotone_torch.py

​Language Surface (Probabilistic Profile)

​Engine Selection and Contracts

​prob_engine=exact_ddnnf (exact)

​prob_engine=mc (approximate Monte Carlo)

​Epistemic Integration Contract

​Exact Path (ExactDdnnfProgram)

​Pipeline Overview

​Conditional probability

​GPU state and caching

​Orchestration boundary (honest residency statement)

​CPU Decision-DNNF compilation (removed)

​GPU-Native Compilation + Verification

​Monte Carlo Path (McProgram)

​Sampling plan

​GPU sampling

​GPU-resident execution engine (device-only counts)

​No host interaction in the measured region

​Evidence conditioning and uncertainty reporting

​Host outputs and CPU validation (feature-gated)

​Python API (pyxlog)

​Provenance Primitives (Rust-only)

​Types

​Fields on Provenance

​Accessors

​Re-exports

​Reproducibility

​See Also