All Posts

Anthropic's Agent SDK Billing Splits on June 15 — Rearchitect Before the Cap Hits

Anthropic moves the Claude Agent SDK to a separate metered credit pool on June 15, 2026 — a five-step rearchitecture I'm running on chays.ai and Freya Coach before the cap kicks in.

9 min read
AI

Anthropic announced on May 14, 2026 that the Claude Agent SDK, the claude -p CLI, GitHub Actions, and any third-party app authenticating through a Claude subscription will move off the shared subscription pool and into a separate metered credit pool on June 15, 2026. Pro plans get $20 of monthly Agent SDK credit, Max 5x gets $100, Max 20x gets $200, billed at standard API rates with no rollover. Interactive use of claude.ai, the Claude Code CLI, and Cowork stays inside the existing subscription. If you are running a production AI loop on a Claude subscription instead of a metered API key, that loop is about to hit a hard ceiling. The fix is not to upgrade the plan — it is to rearchitect the agent loop so re-sent context stops accounting for the majority of the bill.

Why this matters now

The billing change is one week away. Most AI product teams I have spoken to in the last fortnight have not modeled what their current agent loop will cost under the new accounting, because their previous bills were absorbed inside the flat-rate subscription. That changes overnight on June 15. According to research from the Stanford Digital Economy Lab, surfaced in Beam.ai's breakdown of the change, re-sent context accounts for 62% of total agent inference bills — meaning the majority of what teams are about to start paying for is content they have already paid to compute once. The teams that miss this date will not pay a small fine. They will pay a structural one — every agent run will route 62 cents of every billed dollar through context the loop should have cached.

There is also a separate market signal. Stripe shipped AI token billing to general preview in March 2026, with a 30% markup over the upstream LLM rate and per-call metering of OpenAI, Anthropic, and Google. Once the buy-side of token economics standardizes around metered consumption, the sell-side has to too. Anthropic is just first.

What the change actually meters

Anthropic's policy draws a hard line between programmatic and interactive usage. Interactive — claude.ai chats, the Claude Code CLI you type into, Cowork — keeps drawing from the existing subscription pool. Programmatic — the Agent SDK, claude -p, GitHub Actions, and third-party apps that authenticate through your Claude subscription — moves to the new monthly credit pool. Credits are per-user, refresh monthly with the billing cycle, and do not roll over. A one-time opt-in is required.

The unit economics are unchanged. Each call is billed at the standard API rate for the model you invoke. The Agent SDK overview is explicit that this is "standard API rates" — there is no subsidy or discount. What changed is visibility. Before June 15, an agent that re-sent 50,000 tokens of context on every loop was free, because your monthly subscription absorbed it. After June 15, that same agent draws against your credit pool until the pool runs out, at which point your agent stops.

The Anthropic prompt-caching documentation describes a five-minute and one-hour cache TTL with a write cost of 1.25x and 2.0x standard input respectively, and a read cost of 0.1x — a 90% discount on cached input. This is the lever. Most agent loops today re-send the same system prompt, the same tool definitions, and most of the conversation history on every call. With caching, the first call pays a small write premium, and every subsequent call inside the TTL pays a tenth of input cost. The architectural change is not "add caching" — it is "make the loop cache-shaped."

The business implication

There are two ways to read this change from the founder's seat. The first is "Anthropic is making agents more expensive." That read is wrong. The bill was the same; you just did not see it. The second is "Anthropic is putting a meter on what was always free, and the meter is going to surface every loop that wastes tokens." That read is correct, and it has a single business consequence: every team running a Claude-backed agent has roughly one quarter to make their loop cache-shaped or watch their cost-per-user climb past the line where their pricing still works.

I have walked through this math twice in the last fortnight with founders running RAG-style agents on subscription credentials. In both cases the loop was re-sending six-figure token counts per session and paying for it inside a flat-rate plan that hid the bleeding. Once the cap binds, those products stop scaling — not because Claude got more expensive, but because the architecture was already wrong and the bill was disguised.

A five-step rearchitecture playbook

This is the sequence I am running on chays.ai, my CTO automation tooling with tiered memory, and on Freya Coach, the AI pregnancy coaching platform I shipped last year. It works for any Claude-backed agent loop.

  1. Audit context per call. Log the token count of every system prompt, every tool definition, every conversation turn for one full week of production traffic. The agents I have audited carry 30,000 to 200,000 tokens of repeated content per call. You cannot fix what you have not measured.
  2. Separate the agent from the chat. If the same Claude subscription powers both a human chat surface and an automated loop, split them. Move the loop onto a dedicated API key so its consumption draws from the metered pool deliberately, not from whatever subscription happens to be wired in. This also gives you per-loop billing telemetry, which the subscription pool obscures.
  3. Make the loop cache-shaped. Pin the parts of the prompt that do not change per call — system prompt, tool definitions, retrieved context that survives several turns — at the front of the message array, and mark them as cacheable using Anthropic's cache_control blocks. The five-minute TTL is enough for most interactive loops; the one-hour TTL pays for itself if you are running long agent sessions where the same retrieval context spans dozens of turns.
  4. Cap and instrument. Set a per-user, per-day token budget enforced by your backend, not by Anthropic's pool exhaustion. The pool running out is a worst-case failure mode for a paying customer. A budgeted backend that degrades gracefully — falling back to Haiku, summarizing context, or returning a "session limit reached" message — preserves the product experience. Instrument every call with model name, cached vs uncached input count, output count, and cost so a single dashboard shows you the actual per-user economics.
  5. Decide where the loop should live. If your product is consumer-facing and your unit economics depend on tight cost-per-active-user, the loop probably needs to live behind an API key billed at standard API rates rather than on a subscription pool. Stripe's AI token billing infrastructure makes the pass-through clean — meter the call, mark it up, charge the customer. If your product is internal tooling and the loop runs once per user per day, the subscription pool is fine. The point is to choose, not default.

My perspective

The architectural lesson here is not specific to Anthropic. It is that any product whose unit economics are quietly subsidized by a flat-rate vendor plan is one billing change away from being a different product. I shipped Freya Coach on a subscription-style assumption two years ago and had to rebuild the conversation memory pattern when per-call pricing made the original loop uneconomic. On chays.ai, where the explicit value proposition is persistent memory across projects, the entire system is designed around cache-shaped recall — long-lived context lives in tiered memory layers that are summarized, not re-sent, on every call. That design choice is paying off this month in a way that surprises no one who has done this work before.

The contrarian thing I will say plainly is this: if you treat the June 15 change as a billing problem, you will solve it with a credit-card top-up and a Max plan. If you treat it as an architecture problem, you will fix the loop, and the loop fix will outlive the billing change. The teams that learn this in the next two quarters will ship AI products with defensible unit economics. The teams that do not will spend the second half of 2026 explaining to investors why their gross margin slipped six points.

Recommended action this quarter

This week, audit every Claude-backed loop you ship for context bloat and cache-shape. This week or next, opt in to the Agent SDK credit on every plan that touches a production agent, and split the credentials so the human and programmatic surfaces stop sharing a pool. By end of quarter, ship the rearchitecture: cache-shaped prompts, backend-enforced budgets, dashboards that show per-user token economics in real time. After that, the architecture is durable enough that the next vendor billing change — and there will be a next one — becomes a configuration shift, not a rebuild. If you want a second opinion as you walk that audit, a fractional CTO engagement is the fastest way to get one.

Running a Claude-backed product into the June 15 cap?

If you are not sure whether your agent loop is cache-shaped enough to survive the billing change, book a time. Thirty minutes is usually enough to see whether the fix is a config tweak or a full rebuild.