All Posts

Per-User Token Budgets: The Backend Layer That Survives Billing Changes

A backend budget-enforcement layer for AI products: meter per-user token spend, step models down, and degrade gracefully so a vendor credit cap never takes your product offline.

10 min read

A per-user token budget is a small backend control plane that decides, before every model call, how much a given user is allowed to spend and what should happen when they reach the ceiling. It is the layer most AI products skip — and the one that determines whether a vendor billing change is a config edit or an outage. As of today, June 15, Anthropic's Claude Agent SDK draws from a separate metered credit pool that does not roll over, and when the pool empties, automated requests stop. I wrote last week about why that change forces a rearchitecture, and before that about the tiered-memory design that keeps per-call context flat. This is the third piece, and the most concrete: the budget-enforcement layer itself — the backend code that meters spend per user, steps the model down as the budget depletes, and degrades gracefully instead of going dark. I run a version of it in front of chays.ai and Freya Coach.

Why this matters now

The June 15 cap did not change what your agent costs. It changed who notices. Until today, an agent running on a flat-rate subscription could re-send its entire context on every turn and the plan absorbed it. Now that spend draws against a metered pool with no rollover and no automatic fallback, so the cost that was always there becomes a hard ceiling that takes the product offline when it is hit.

The reason to own a budget layer, rather than just watch the meter, is that token cost is not something you can predict per user. The Stanford Digital Economy Lab's study of agentic token consumption found that agentic tasks burn on the order of a thousand times more tokens than ordinary chat, that the input side rather than the output drives the bill, and — the part that should worry any CTO — that the same task run twice can differ by up to 30x in total tokens, while the models themselves systematically underestimate their own usage. Translation: you cannot cap a user by guessing, and you cannot trust the model to self-limit. A heavy user on an unbounded loop can quietly consume thirty times what a light user spends on the identical feature, and nothing in the stack will warn you until the invoice or the credit cap does. A flat subscription hid that variance. The metered pool makes it legible and gives you a deadline to put a controller between your users and the meter.

What a budget layer actually is

A per-user token budget is the rule that turns "spend whatever it takes" into "spend up to X, then do Y" — enforced in your backend, not hoped for in a prompt. It sits as middleware around the model call. Before the call, it checks how much the user has spent this period and decides how to behave. After the call, it records what was actually consumed. It needs exactly three things: a ledger that tracks spend per user per billing period, a pre-flight estimate of what the next call will cost, and a policy that says what to do at each threshold.

The pre-flight estimate is the part teams assume they have to approximate. They do not. Anthropic's token counting endpoint returns an exact input-token count for the same payload you are about to send — system prompt, tools, messages, images — without running inference and without being billed for it. That means you can know the cost of a call before you commit to it, which is what makes a hard cap enforceable instead of advisory.

The four-state degradation model

The instinct is to build budgeting as an on/off switch: under budget, run; over budget, stop. That is the failure mode, not the design. Model it as four states instead, because most of your savings happen before anyone is blocked:

  • Full — comfortably under the soft cap. Best model, full context, every tool available.
  • Reduced — approaching the soft cap. Step the model down one tier and trim optional context. The user notices nothing but a small latency change.
  • Minimal — at the soft cap. Cheapest capable model, summarized context only, expensive tools disabled.
  • Blocked — hard cap reached. No model call at all. Return a deterministic, helpful response that names the limit and offers an upgrade path.

The model step-down is a real lever, not a rounding error. At Anthropic's list prices, Claude Opus 4.8 is $5 per million input tokens and $25 output; Sonnet 4.6 is $3 and $15; Haiku 4.5 is $1 and $5. Dropping a verbose, low-stakes turn from Opus to Haiku cuts that call's input and output cost five-fold without touching context size at all. Pair the step-down with prompt caching, where a cache read costs a tenth of standard input, and the Reduced and Minimal states do most of the work while Blocked stays the rare exception it should be.

Here is the wrapper that enforces it — one function that picks the model, holds the cap, and records the spend:

// One wrapper decides the model, enforces the cap, and records the spend.
async function runWithBudget(user: User, request: AgentRequest) {
  const remaining = await ledger.remainingTokens(user.id)        // this billing period
  const estimate = await client.messages.countTokens(request)    // free, exact, pre-flight

  const state = classify(remaining, estimate.input_tokens)
  if (state === "blocked") {
    return gracefulLimitResponse(user)   // deterministic, no model call, offers an upgrade
  }

  const policy = POLICIES[state]         // model + context budget per state
  const response = await client.messages.create({
    ...request,
    model: policy.model,                 // opus -> sonnet -> haiku as budget depletes
    system: policy.trimContext(request.system),
  })

  // Record real usage. Cached input bills at ~0.1x, so count it on its own line.
  const { input_tokens, output_tokens, cache_read_input_tokens } = response.usage
  await ledger.record(user.id, { input_tokens, output_tokens, cache_read_input_tokens })

  // Bill the customer for what they actually used.
  await stripe.billing.meterEvents.create({
    event_name: "agent_tokens",
    payload: { stripe_customer_id: user.stripeId, value: String(input_tokens + output_tokens) },
  })

  return response
}

function classify(remaining: number, needed: number): BudgetState {
  if (remaining <= 0 || remaining < needed) return "blocked"  // can't afford this call
  if (remaining < SOFT_CAP * 0.1) return "minimal"
  if (remaining < SOFT_CAP * 0.3) return "reduced"
  return "full"
}

What this buys a founder or CTO

A budget layer answers three questions you otherwise cannot. What does my most expensive user actually cost me — the ledger has it. What is my margin per plan tier — budget against price, per user, not averaged across everyone. What happens at the cap — a defined product experience, written by you, instead of an outage written by your vendor's default. It also inverts the billing change: the cap becomes yours, set where your margin needs it, with a downgrade path you designed, rather than Anthropic's flat "requests stop." And the same instrumentation makes usage-based pricing possible. The Stripe meter event that enforces the budget can also bill the customer, so heavy users pay for what they use instead of being quietly subsidized by the light ones.

The five-step rollout

  1. Meter first, enforce later. Ship the ledger and count_tokens logging in shadow mode and watch real cost-per-active-user for a week before you set a single threshold. You will be wrong about who your expensive users are.
  2. Set budgets from data, not vibes. Use the shadow numbers to set soft and hard caps per plan tier, with a margin you can defend to a board.
  3. Build the state machine, not a switch. Implement Full, Reduced, and Minimal so the overwhelming majority of savings land before anyone is ever blocked.
  4. Make Blocked a real product surface. Write the limit-reached response with the same care as your onboarding — it is the exact moment a paying user decides to upgrade or churn.
  5. Wire the meter to billing. Emit the same usage to your billing provider so the budget that protects your margin also bills the user.

My perspective

I will say the contrarian part plainly: the most important AI feature you ship this quarter is one your users will never see. It is the budget layer. I learned that the expensive way on Freya Coach, where the original conversation loop was economical right up until per-call pricing made it not, and I had to rebuild the memory pattern after the fact. When I built chays.ai, a CTO agent that runs across many clients' projects, I put the budget layer in from day one — because that workload is exactly what Stanford describes, stochastic and input-heavy and impossible to predict per call, and the only way to run it profitably is to cap it per user and degrade, not to hope the model behaves. Choosing a cheaper model feels like the fix, but it is a delay. The cap will find you again at the next billing change, and the one after that. The budget layer is the thing that makes every future change a config edit instead of a fire drill.

Recommended action this quarter

This week, put count_tokens and a usage ledger in shadow mode on one production loop and enforce nothing — just watch. This month, read the shadow data, set soft and hard caps per plan, and ship the Reduced and Minimal states; the model step-down alone usually pays for the work. By the end of the quarter, ship the Blocked state as a real upgrade surface and wire usage to billing. If you want the layer reviewed before the next vendor change tests it for you, a fractional CTO engagement is the fastest way to get a second set of eyes on where your tokens are actually going.

Will the next billing change be a config edit or an outage?

If you cannot say what your heaviest user costs you, that is the first thing to fix. Book a time — thirty minutes is usually enough to map where your tokens go and whether you need a budget layer or a redesign.