Skip to main content

Module budget

Module budget 

Source
Expand description

GlobalDeviceBudget — per-runtime byte-limit decorator.

Wraps a DeviceMemoryResource and enforces a single byte limit across all allocations that flow through it. Designed to be the per-runtime singleton replacement for the v0.5 per-provider GpuMemoryManager (which had no way to enforce a coherent budget across parallel tests, multiple providers, or Python callers sharing one physical GPU).

§Accounting model

GlobalDeviceBudget keeps reserved_bytes strictly equal to inner.bytes_outstanding() at every quiescent moment. This is the “live + retired-but-not-yet-freed” view from the trait — exactly the bytes the budget should be guarding.

To keep that invariant under both synchronous and stream-ordered async inners, every public method is serialized through a single Mutex<BudgetState> and the inner call is invoked inside the lock. The lock window is bounded by the inner’s CUDA call, which is in any case the dominant cost — the budget decorator does not add hot-path overhead beyond what the inner already imposes.

§Allocate

  1. Lock state.
  2. If reserved + bytes > limit: return ResourceError::OutOfBudget { requested, remaining }.
  3. Optimistically reserve: reserved += bytes.
  4. Call inner.allocate(bytes, ..) under the lock. The inner’s own bookkeeping moves bytes from “free” to “live”.
  5. If inner returned Err, roll back the reservation: reserved -= bytes. Forward the error.

§Deallocate / Reap

For both methods we sample inner.bytes_outstanding() before and after the inner call (under the lock), and decrement reserved by the observed delta. The pattern handles both backends without branching:

  • Synchronous inner (DirectCudaResource): bytes_outstanding drops by the block’s bytes on deallocate, so the delta is block.bytes. reap_pending is a no-op (delta zero).
  • Stream-ordered async inner (AsyncCudaResource): deallocate moves bytes from “live” to “pending”; bytes_outstanding stays the same, so the delta is zero — the budget is not released yet. reap_pending drains the pending bytes whose queued cuMemFreeAsync has completed; bytes_outstanding drops by the drained total and the budget releases that same total.

Because the inner call and the before/after samples happen under the same lock, no concurrent budget op can perturb the inner’s bytes_outstanding between our reads — the delta strictly reflects this call’s effect on the inner.

§Composition

GlobalDeviceBudget is a normal DeviceMemoryResource, so it plugs into [XlogDeviceRuntime::with_resource] and stacks under / over [LoggingResource]. Recommended ordering for production: GlobalDeviceBudget(LoggingResource(AsyncCudaResource)). That gives the budget atomic accounting, the logger sees the eventually-applied call (so OutOfBudget errors do not get double-logged), and the underlying allocator is reached last. Tests can stack either way.

Structs§

GlobalDeviceBudget
Per-runtime byte-limit decorator.