Context Window
The maximum number of tokens a language model can process in a single forward pass — encompassing the system prompt, conversation history, retrieved documents, tool results, and the model's own generated output.
Definition
The context window is the total token budget available to a language model during a single inference call. Every token that the model can "see" and reason about must fit within this limit. Exceeding it requires truncation, summarization, or retrieval-based strategies.
Modern frontier models have dramatically extended context windows — from GPT-3's 4K tokens to 128K (GPT-4o), 200K (Claude 3), and beyond — but the context window remains a fundamental design constraint in every agent system.
What Counts Against the Limit?
| Component | Typical Token Cost |
|---|---|
| System prompt | 500 – 2,000 tokens |
| Conversation history | Grows linearly with turns |
| Retrieved RAG chunks | 500 – 2,000 per retrieval |
| Tool schemas | 100 – 500 per tool |
| Tool call results | Varies widely |
| Model output (completion) | 500 – 4,000 tokens typical |
Context Window vs. Memory
The context window is not the same as agent memory. The context window is ephemeral — it exists only for a single inference call. Agent memory systems (vector stores, key-value caches, episode summarization) persist information across calls and selectively load relevant pieces into the context window when needed.
Strategies for Managing Context
- Sliding window — drop the oldest messages when the window fills up.
- Summarization — periodically compress prior turns into a compact summary.
- RAG / selective retrieval — store long documents externally and retrieve only relevant chunks.
- Tool result truncation — trim or summarize verbose tool outputs before passing them back to the model.
- Structured state — maintain agent state as a typed object rather than raw conversation history; serialize only what the model needs.
Performance Implications
Inference cost scales roughly linearly with total context tokens (input + output). Latency, particularly time-to-first-token, also increases with context length. For production agents, context management is one of the primary levers for controlling cost and speed.