Expand description
GlobalDeviceBudget — per-runtime byte-limit decorator.
Wraps a DeviceMemoryResource and enforces a single byte limit
across all allocations that flow through it. Designed to be the
per-runtime singleton replacement for the v0.5 per-provider
GpuMemoryManager (which had no way to enforce a coherent budget
across parallel tests, multiple providers, or Python callers
sharing one physical GPU).
§Accounting model
GlobalDeviceBudget keeps reserved_bytes strictly equal to
inner.bytes_outstanding() at every quiescent moment. This is the
“live + retired-but-not-yet-freed” view from the trait — exactly
the bytes the budget should be guarding.
To keep that invariant under both synchronous and stream-ordered
async inners, every public method is serialized through a single
Mutex<BudgetState> and the inner call is invoked inside the
lock. The lock window is bounded by the inner’s CUDA call, which
is in any case the dominant cost — the budget decorator does not
add hot-path overhead beyond what the inner already imposes.
§Allocate
- Lock state.
- If
reserved + bytes > limit: returnResourceError::OutOfBudget { requested, remaining }. - Optimistically reserve:
reserved += bytes. - Call
inner.allocate(bytes, ..)under the lock. The inner’s own bookkeeping movesbytesfrom “free” to “live”. - If inner returned
Err, roll back the reservation:reserved -= bytes. Forward the error.
§Deallocate / Reap
For both methods we sample inner.bytes_outstanding() before and
after the inner call (under the lock), and decrement reserved
by the observed delta. The pattern handles both backends without
branching:
- Synchronous inner (
DirectCudaResource):bytes_outstandingdrops by the block’s bytes ondeallocate, so the delta isblock.bytes.reap_pendingis a no-op (delta zero). - Stream-ordered async inner (
AsyncCudaResource):deallocatemoves bytes from “live” to “pending”;bytes_outstandingstays the same, so the delta is zero — the budget is not released yet.reap_pendingdrains the pending bytes whose queuedcuMemFreeAsynchas completed;bytes_outstandingdrops by the drained total and the budget releases that same total.
Because the inner call and the before/after samples happen under
the same lock, no concurrent budget op can perturb the inner’s
bytes_outstanding between our reads — the delta strictly
reflects this call’s effect on the inner.
§Composition
GlobalDeviceBudget is a normal DeviceMemoryResource, so it
plugs into [XlogDeviceRuntime::with_resource] and stacks under
/ over [LoggingResource]. Recommended ordering for production:
GlobalDeviceBudget(LoggingResource(AsyncCudaResource)). That
gives the budget atomic accounting, the logger sees the
eventually-applied call (so OutOfBudget errors do not get
double-logged), and the underlying allocator is reached last.
Tests can stack either way.
Structs§
- Global
Device Budget - Per-runtime byte-limit decorator.