All Posts

The Memory Architecture Behind chays.ai: Tiered State, Not a Growing Prompt

The tiered memory architecture behind chays.ai — core, recall, and archival layers that summarize context instead of re-sending it, and the token math that makes it survive the Agent SDK cap.

8 min read

Tiered memory is the difference between an agent that remembers and an agent that re-reads. Most production agents have no memory architecture — they have a prompt that grows until it hits a wall. chays.ai, my CTO-automation tooling, is built the other way: long-lived context lives in distinct memory tiers that are summarized and recalled on demand, not re-sent on every call. That was an engineering preference a year ago. As of June 15, when Anthropic moves the Claude Agent SDK to a metered credit pool — I wrote up that change last week — it is the difference between a product that scales and one that stalls. This post is the architecture under that playbook: how the tiers are shaped, what they cost, and the four-step pattern to add them to an agent you already run.

Why this matters now

The cap binds on June 15 — two days from this writing. Until now, an agent running on a Claude subscription paid nothing visible for re-sending its context, because the flat-rate plan absorbed it. After June 15, every re-sent token draws against a metered pool that does not roll over, and when the pool empties the agent stops. The cost did not change; the visibility did.

Research from the Stanford Digital Economy Lab puts the problem in sharp relief. An agent reads the task, gets a response, then re-reads the original prompt plus that response before its next action, then re-reads all of that plus the new response before the action after — a context snowball in which a ten-turn loop sends roughly fifty times the tokens of a single call. The input side, not the output, is where agent bills are made. A growing prompt is not a memory system. It is a bill that compounds every turn, and the meter is about to make it legible.

What “tiered memory” actually means

The pattern comes from the MemGPT paper out of UC Berkeley, which framed the LLM as an operating system managing a limited context window the way an OS manages limited RAM. Letta, the framework that grew out of that work, implements it as three tiers, and chays.ai uses the same shape:

  • Core memory lives in-context on every call — it is the agent’s RAM. Identity, the current project’s invariants, the user’s standing preferences. Small, always present, never retrieved.
  • Recall memory is the conversation history, searchable on demand rather than replayed wholesale. The agent queries it when a turn needs prior detail, instead of carrying every prior turn forward.
  • Archival memory is an external vector store the agent searches explicitly — long documents, past architecture decisions, anything that does not fit in context and is not needed every turn.

The agent decides what to promote into core memory and what to flush to archival, using tool calls — the model is its own memory controller. The consequence for cost is that the only tokens riding on every call are core memory and the system prompt. Everything else is fetched when needed and summarized when stored, so per-call context stays roughly flat instead of growing with session length.

That stable front-of-prompt is also what makes the loop cache-shaped. Anthropic’s prompt caching charges a 90% discount on cached input — a cache read costs 0.1x standard input against a write premium of 1.25x for the five-minute TTL — so pinning core memory, the system prompt, and the tool definitions at the front of the message array means the expensive, unchanging part of every call is paid for once and read back cheaply. Here is the message-assembly shape chays.ai uses:

// Assemble a cache-shaped request: stable context first, volatile last.
async function buildAgentRequest(session: Session, userTurn: string) {
  const stable = [
    { type: 'text', text: SYSTEM_PROMPT },                 // never changes
    { type: 'text', text: serializeCoreMemory(session) },  // small, slow-changing
  ]

  return {
    model: 'claude-sonnet-4-6',
    // Mark the end of the stable prefix as a cache breakpoint.
    system: stable.map((block, i) => ({
      ...block,
      ...(i === stable.length - 1
        ? { cache_control: { type: 'ephemeral' } }
        : {}),
    })),
    messages: [
      // Recall is searched, not replayed: pull only the turns this one needs.
      ...(await recall.search(session.id, userTurn, { limit: 6 })),
      { role: 'user', content: userTurn },
    ],
    max_tokens: 1024,
  }
}

The token math

Put numbers on it. Take a support agent with a 30,000-token system-plus-tools preamble and a session that runs twenty turns. Naively, that preamble rides on all twenty calls: 600,000 input tokens for the unchanging prefix alone. On Claude Sonnet 4.6 at $3 per million input tokens, that prefix costs about $1.80 per session, before the conversation-history snowball on top.

Cache-shape the same loop and the first call writes the prefix at 1.25x (about $0.11 for 30k tokens) while the remaining nineteen calls read it at 0.1x (about $0.017 in total). The comparison, per twenty-turn session:

  • Unchanging 30k-token prefix — naive: about $1.80. Cache-shaped: about $0.13. Roughly a 93% cut on that line item.
  • Conversation history — naive: grows linearly every turn, unbounded. Tiered: searched on demand, roughly flat.
  • Behavior at the credit cap — naive: the agent stops. Budgeted and tiered: it degrades gracefully.

The numbers move with your model and your traffic, but the shape does not: the unchanging part of the prompt should be paid for once, and the changing part should be bounded by retrieval, not accumulation.

A four-step pattern for adding tiered memory

  1. Name your tiers. Decide what must be in-context every call (core), what can be searched (recall), and what lives in a vector store (archival). Most agents discover that the bulk of what they carry forward is recall, not core.
  2. Pin and cache the stable prefix. Put the system prompt, tool definitions, and core memory at the front of the message array and mark the prefix cacheable. This is the single highest-leverage change, and it is a config change, not a rewrite.
  3. Replace replay with retrieval. Stop appending the full history to every call. Summarize completed turns into recall and search it for the few turns the current step actually needs.
  4. Budget and instrument per user. Enforce a per-user token budget in your backend so you control the failure mode, and log cached-versus-uncached input on every call so one dashboard shows real cost-per-active-user. This is step four of the playbook I published last week, made concrete.

My perspective

I will say the unpopular part plainly: most teams do not need a bigger context window or a cheaper model. They need to stop re-sending things they have already paid to process. I built chays.ai around persistent memory across projects because the whole value proposition — a CTO agent that remembers last quarter’s architecture decision — is impossible if every call re-reads the entire project history. That constraint forced a tiered design from day one. When I shipped Freya Coach two years ago I did not have that discipline, and I had to rebuild its conversation-memory pattern when per-call pricing made the original loop uneconomic. The lesson cost me a rebuild then. The June 15 change is about to teach the same lesson to every team that skipped it — except now the tuition is a hard credit cap instead of a line on an invoice you could ignore.

Recommended action this quarter

This week, instrument one production loop and measure how many of its per-call tokens are unchanging — that number is your caching opportunity, and it is usually larger than the team guesses. This month, pin and cache the stable prefix and move conversation history behind retrieval; both are reversible, low-risk changes. By end of quarter, draw the tier boundaries explicitly and put a per-user budget in front of the loop so the cap never surprises a paying customer. If you want a second set of eyes on where your agent’s tokens are actually going, a fractional CTO engagement is the fastest way to get a memory architecture reviewed before the meter does it for you.

Is your agent’s memory an architecture or just a growing prompt?

If you are not sure how much of your agent’s bill is re-sent context, book a time. Thirty minutes is usually enough to tell whether you need a config change or a memory redesign.