Memory & State Management in LLM Agents
LLM agents are only as capable as their memory architecture. This guide breaks down the four memory tiers — in-context, external retrieval, episodic, and procedural — with implementation patterns and trade-off analysis for production systems.
A stateless LLM can answer a question. An agent needs to pursue a goal — across multiple steps, possibly multiple sessions, and often alongside other agents. That requires memory.
Memory is not one thing in agent systems. It is a stack of complementary mechanisms, each with a different scope, cost, and retrieval latency. Choosing the wrong tier for a use-case is one of the most common sources of unreliability in production agents. This guide maps the full landscape, shows when to use each tier, and provides concrete implementation patterns you can apply today.
Important
This article builds directly on the architecture concepts introduced in What Is Agent Engineering? If you haven't read that piece, start there — it establishes the perception → reasoning → action loop that memory plugs into.
Why Memory Matters More Than Context Window Size
A common reflex when an agent runs out of context is to reach for a model with a larger window. GPT-5 supports up to 128,000 tokens in its baseline tier; Claude Opus 4.6 and Claude Sonnet 4.6 each support 200,000 tokens; Google's Gemini 2.5 Pro reaches 2,000,000 tokens. These are genuinely large, but they are still finite, they are expensive (cost scales linearly with tokens), and they are ephemeral — every new conversation starts from zero.
The real question is not "how much context can the model see right now?" but "what does the agent need to know, and for how long does it need to know it?" Those are architectural questions that context window size cannot answer.
Research from DeepMind and academic groups has formalized this intuition. The Cognitive Architectures for Language Agents (CoALA) framework (Sumers et al., 2023) proposes a four-tier memory taxonomy directly analogous to human cognitive science: working memory, episodic memory, semantic memory, and procedural memory. That taxonomy maps cleanly onto the four implementation tiers below.
The Four Memory Tiers
Tier 1 — In-Context Memory (Working Memory)
What it is: Everything currently loaded into the model's active context window: the system prompt, conversation history, tool results, and any injected documents.
Characteristics:
- Zero retrieval latency — the model attends to it directly
- Hard upper bound (the model's context limit)
- Completely lost at the end of a session unless explicitly persisted elsewhere
- Costs scale with token count at inference time
When to use it:
- Short-lived tasks that fit within a single session
- Information that must be coherently attended to together (e.g., a code file being reviewed end-to-end)
- Temporary scratchpad for intermediate reasoning steps
Implementation pattern — sliding window with summarization:
MAX_HISTORY_TOKENS = 4_000 # reserve the rest of the window for tools + output
def trim_history(messages: list[dict], tokenizer) -> list[dict]:
"""Keep the system prompt, summarize older turns, keep recent turns verbatim."""
total = sum(len(tokenizer.encode(m["content"])) for m in messages)
while total > MAX_HISTORY_TOKENS and len(messages) > 2:
# Remove the oldest non-system message
messages.pop(1)
total = sum(len(tokenizer.encode(m["content"])) for m in messages)
return messages
The limitation here is information loss. Once a message is evicted, the agent cannot recall it. That eviction problem motivates the next two tiers.
Tier 2 — External Memory (Retrieval / Vector Store)
What it is: Information stored outside the model, indexed for semantic retrieval. At query time, the agent converts a query to an embedding, searches the index for the most relevant chunks, and injects them into the active context.
This is the foundation of Retrieval-Augmented Generation (RAG), introduced in Lewis et al. (2020) and now the standard architecture for knowledge-grounded agents.
Characteristics:
- Effectively unlimited storage
- Retrieval latency of roughly 50–200 ms for a managed vector database
- Quality is bounded by chunking strategy and embedding model quality
- Information can be updated without retraining the base model
When to use it:
- Large or frequently updated knowledge bases (product docs, legal corpora, codebases)
- User-specific facts that must persist across sessions (preferences, prior decisions)
- Any information that would overflow the context window if loaded in full
Retrieval as a tool call:
Rather than treating retrieval as a preprocessing step that happens before the model is called, the modern pattern — consistent with tool-use architecture — is to expose retrieval as an explicit tool the agent can invoke:
tools = [
{
"name": "search_knowledge_base",
"description": "Retrieve relevant documentation chunks for a given query.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "A focused search query. Be specific."
},
"top_k": {
"type": "integer",
"description": "Number of chunks to return. Default 5.",
"default": 5
}
},
"required": ["query"]
}
}
]
This keeps retrieval visible in the agent's reasoning trace, which is critical for debugging and evaluation — a topic covered in a later piece in this series.
Popular vector store options:
| Store | Best for | Hosting |
|---|---|---|
| Pinecone | Managed, production scale | Cloud only |
| Weaviate | Hybrid (keyword + vector) search | Self-host or cloud |
| pgvector | Teams already on PostgreSQL | Self-host |
| Chroma | Local development, prototyping | Embedded or self-host |
| Qdrant | High-throughput, Rust-native | Self-host or cloud |
For most teams, pgvector is the lowest-friction path to production if you already run Postgres. Pinecone is the right choice when you need a fully managed solution that scales independently of your application database.
Tip
Use a re-ranker between retrieval and injection. First-pass embedding similarity is fast but imprecise. A lightweight cross-encoder re-ranker (such as cross-encoder/ms-marco-MiniLM-L-6-v2) running over the top-20 candidates can surface the top-5 most relevant chunks with dramatically better precision — at a cost of 10–30 ms of added latency.
Tier 3 — Episodic Memory (Event Log)
What it is: A structured, time-ordered record of what the agent did — tool calls made, results received, decisions taken, and user interactions. Unlike in-context memory (which is the live working set), episodic memory is the durable log that survives session boundaries.
The term comes directly from cognitive science: episodic memory is how humans recall specific events ("I ran a database migration on Tuesday") as opposed to general facts ("PostgreSQL requires a primary key").
Characteristics:
- Durable across sessions — stored in a database, not in the context window
- Can be retrieved selectively (e.g., "what did this agent do yesterday?") or summarized
- Constitutes the audit trail, which is essential for debugging and compliance
- Can be used to bootstrap future sessions without replaying the full history
When to use it:
- Any agent that must maintain continuity across sessions (personal assistants, long-running workflows)
- Multi-agent systems that need a shared history of what each agent did
- Regulated industries where an audit trail is a compliance requirement
LangGraph's checkpointer model:
LangGraph, one of the frameworks profiled on this site, implements episodic memory as checkpoints — snapshots of the full graph state at each node execution. This is a first-class feature, not an afterthought:
from langgraph.checkpoint.postgres import PostgresSaver
# Every state transition is persisted atomically
checkpointer = PostgresSaver.from_conn_string(os.environ["DATABASE_URL"])
graph = StateGraph(AgentState)
# ... define nodes/edges ...
app = graph.compile(checkpointer=checkpointer)
# Resume a prior run by thread_id
config = {"configurable": {"thread_id": "user-abc-session-42"}}
result = await app.ainvoke({"messages": [HumanMessage(content="Continue where we left off")]}, config)
LangGraph supports in-memory/SQLite (via MemorySaver — suitable for development) and PostgreSQL as its primary checkpoint backends, with Redis available via the langgraph-checkpoint-redis package.
Session summarization pattern:
For long-running agents, the event log grows unbounded. The standard mitigation is progressive summarization: at regular intervals (e.g., every 20 turns or every hour), run a summarizer agent over the oldest un-summarized events and replace them with a compact summary that is stored as a special episodic record.
Tier 4 — Procedural Memory (Fine-tuned / Baked-in Knowledge)
What it is: Knowledge encoded directly into the model's weights through pre-training or fine-tuning. The model "knows" how to write Python, what HIPAA requires, or how your internal API is structured — without needing to retrieve that information at runtime.
Characteristics:
- Zero retrieval cost at inference time
- Expensive and slow to update — requires a training run
- No "source of truth" link — the model cannot cite the document it learned from
- Best for stable, high-frequency knowledge that would otherwise bloat every context window
When to use it:
- High-volume production agents where retrieval latency is a bottleneck
- Stable domain knowledge that changes infrequently (legal definitions, medical coding standards)
- Behavioral instructions that should be unconditionally followed (tone, output format, safety rules)
When not to use it:
- Knowledge that changes more than monthly — fine-tuning cadence is too slow
- Factual knowledge where citations matter — fine-tuned models cannot self-cite
- Early in development — the iteration cycle for prompts is hours, for fine-tuning it's days
Warning
Fine-tuning can overwrite general capabilities in unexpected ways. Teams that fine-tune agents for narrow domains frequently report degraded performance on adjacent tasks. Always maintain a regression evaluation suite before and after a fine-tuning run.
Choosing the Right Tier: A Decision Framework
The four tiers are not mutually exclusive — most production agents use all four simultaneously. The decision is which tier owns a given piece of information.
| Question | Recommended tier |
|---|---|
| Does this information need to be coherently attended to all at once? | In-context |
| Is this information too large to fit in context? | External retrieval |
| Does the agent need to recall what it did across sessions? | Episodic log |
| Is this high-frequency, stable knowledge where retrieval latency is unacceptable? | Procedural (fine-tune) |
| Does this information change frequently? | External retrieval (never fine-tune) |
| Is an audit trail required? | Episodic log |
A practical starting point for most teams is this hierarchy: start with in-context, add external retrieval when context overflows, add episodic logging when multi-session continuity is needed, and consider fine-tuning only when you have proven retrieval cannot meet latency requirements.
The MemGPT Insight: Treating the OS as a Model for Agent Memory
One of the most influential ideas in agent memory architecture came from a 2023 paper out of UC Berkeley: MemGPT: Towards LLMs as Operating Systems (Packer et al., 2023, NeurIPS 2023). The system — now rebranded as Letta — draws a direct analogy between OS memory hierarchies (L1/L2 cache, RAM, disk) and LLM agent memory.
The key insight: the agent itself should be responsible for deciding what to page in and out of context. Rather than having an external controller manage memory, MemGPT/Letta gives the agent tools to explicitly read from and write to a persistent memory store, treating the context window as a managed cache.
┌─────────────────────────────────────────────┐
│ Context Window (RAM) │
│ ┌──────────────┐ ┌──────────────────────┐ │
│ │ System Prompt│ │ Working Memory │ │
│ │ (fixed) │ │ (agent-managed KV) │ │
│ └──────────────┘ └──────────────────────┘ │
└─────────────────────────────────────────────┘
↕ paging
┌─────────────────────────────────────────────┐
│ External Storage (Disk) │
│ - Archival memory (vector store) │
│ - Conversation history (episodic log) │
└─────────────────────────────────────────────┘
This architecture is production-proven: OpenAI's ChatGPT memory feature (launched February 2024) implements a similar model, where the assistant maintains a structured key-value memory store and explicitly decides when to read from or update it.
State Management in Multi-Agent Systems
When multiple agents collaborate — as covered in the Multi-Agent Orchestration guide — memory architecture becomes a coordination problem. There are two primary models:
Shared State
All agents read from and write to a single state object. The orchestrating framework (LangGraph's StateGraph, for example) manages concurrent access with reducers that specify how values from different agents are merged.
from typing import Annotated
from operator import add
class SharedAgentState(TypedDict):
messages: Annotated[list, add] # messages from all agents are appended
research_notes: Annotated[list, add] # each agent contributes to the shared pool
final_answer: str # last-writer-wins (no reducer)
Advantage: Agents can see each other's work without explicit message passing.
Risk: Write conflicts and unintended state overrides. Use reducers on every field that multiple agents write to.
Isolated State with Handoffs
Each agent maintains its own private state. When control transfers from one agent to another, a structured message (the handoff payload) carries only what the receiving agent needs. This is the model used by OpenAI Swarm and Pydantic AI's agent handoff API.
Advantage: Easier to reason about, easier to test individual agents in isolation.
Trade-off: The handoff payload must be carefully designed — omitting context the receiving agent needs is a common source of bugs.
Tip
For most teams, start with isolated state and explicit handoffs. Shared state feels elegant in design but is difficult to debug when two agents write conflicting values. Move to shared state only for agents that genuinely need real-time visibility into each other's work.
Common Pitfalls
1. Treating retrieval as preprocessing instead of a tool
Running retrieval before the LLM call and blindly stuffing results into context means the agent cannot reason about retrieval quality. Expose retrieval as a tool; let the agent decide when to use it and whether the results are sufficient.
2. Ignoring context window cost at scale
At frontier model pricing of roughly $0.002–$0.005 per 1K input tokens (across GPT-5, Claude Sonnet 4.6, and Gemini 2.5 Pro tiers as of Q1 2026), a 50,000-token active context window costs $0.10–$0.25 per call. For an agent making 10 calls per task handling 10,000 tasks per month, that is $10,000–$25,000/month in context costs alone — before output tokens. Memory design is a cost optimization problem.
3. No compression strategy for long-running episodic logs
Without progressive summarization, episodic logs grow indefinitely. After two weeks of active use, a personal assistant agent can accumulate 500K+ tokens of history. Define a compaction schedule before going to production.
4. Writing to all tiers simultaneously
Some implementations persist to external storage on every agent turn. For high-frequency agents, the write latency to a vector database or PostgreSQL checkpoint can become the bottleneck. Batch writes or write asynchronously when consistency requirements allow.
5. Conflating memory tiers in debugging
When an agent produces wrong output, the root cause may be in any tier: stale data in the vector store, corrupted episodic state, or a fine-tuned behavior that overrides the intended prompt. Always instrument each tier separately in your observability stack.
Key Takeaways
- Memory is a stack, not a single mechanism. In-context, external retrieval, episodic, and procedural memory each serve a distinct role.
- In-context memory is the cache. Fast and coherent, but finite and ephemeral.
- External retrieval (RAG) is the long-term store. Scalable and updatable, at the cost of retrieval latency and chunking complexity.
- Episodic memory is the audit trail and continuity layer. Essential for multi-session agents and regulated industries.
- Procedural memory (fine-tuning) is the last resort. Only warranted when retrieval latency is genuinely unacceptable and knowledge is stable.
- State management in multi-agent systems is a coordination problem. Start with isolated state and explicit handoffs; add shared state only when necessary.
Frequently Asked Questions
What is in-context memory in an LLM agent?
In-context memory is everything currently loaded into the model's active context window — the system prompt, conversation history, and injected documents. It is the agent's working memory: immediately accessible, but limited by the model's context window size and lost at the end of a session.
What is the difference between RAG and in-context memory?
In-context memory is what the model currently sees. RAG (Retrieval-Augmented Generation) is a technique for populating that context by retrieving relevant chunks from an external vector store. RAG solves the scale problem: instead of loading an entire knowledge base into context, the agent retrieves only the most relevant fragments at query time.
How do LLM agents remember things across sessions?
Session-persistent memory requires an explicit external store. The most common patterns are: (1) an episodic log — a database of past interactions that can be summarized and injected into future sessions, and (2) an external key-value or vector store where the agent writes facts it should remember. LangGraph's checkpointer API and architectures like MemGPT/Letta implement this at the framework level.
When should I fine-tune a model for an agent instead of using RAG?
Fine-tuning is appropriate when: the knowledge is stable (changes less than monthly), retrieval latency is a hard constraint, and you have enough labeled data to train on. For most teams, RAG is the better starting point because it is faster to update, supports citations, and requires no training infrastructure.
What vector database should I use for agent memory?
Start with pgvector if you already run PostgreSQL — it requires the least additional infrastructure. Use Pinecone for a fully managed solution that scales independently of your application. Use Chroma for local development. Evaluate Weaviate if you need hybrid keyword + vector search.
How do multi-agent systems share memory?
Two primary models: (1) shared state, where all agents read/write a single state object managed by the orchestration framework (e.g., LangGraph's StateGraph with reducers), and (2) isolated state with handoffs, where each agent maintains private state and transfers context via structured messages when passing control. Start with isolated state — it is easier to test and debug.
What to Read Next
- RAG for Agents: Retrieval as a First-Class Tool — the next piece in this series (Day 3), diving deep into chunking, embedding, and re-ranking for agent-specific retrieval workloads
- Multi-Agent Orchestration: Patterns and Trade-offs — how coordination patterns interact with the shared vs. isolated state models described above
- Tool Use in LLM Agents: Patterns, Pitfalls, and Best Practices — the tool-call pattern that underlies retrieval-as-a-tool
References: Sumers et al. (2023) "Cognitive Architectures for Language Agents," arXiv:2309.02427. Packer et al. (2023) "MemGPT: Towards LLMs as Operating Systems," NeurIPS 2023. Lewis et al. (2020) "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," NeurIPS 2020.