Context Windows Explained: Understanding LLM Memory Limits

A context window is the maximum number of tokens a large language model can process in a single forward pass — everything the model “sees” at once: system prompt, user message, retrieved documents, and the generated response.
As of 2026, practical limits vary by vendor: GPT-4o at 128K tokens, GPT-4.1 up to 1M, Claude at 200K (1M beta for enterprise), Gemini 1.5/2.0 Pro at 2M, and Llama 4 advertising up to 10M.
Standard self-attention scales as O(n²) in sequence length — doubling the window roughly quadruples compute and memory, which is why long context is expensive.
Techniques like sparse attention, sliding-window attention, YaRN, and ALiBi extend effective context without a full quadratic cost. Retrieval-augmented generation offers an orthogonal path: keep the window small, fetch only what matters.
Published limits describe capacity, not comprehension. The “needle in a haystack” test and follow-ups show that models often miss detail in the middle of long prompts — so application design still matters even with million-token windows.

What a context window actually is

At inference time, a transformer reads its input as a sequence of tokens and produces output one token at a time. The context window is the ceiling on that sequence: every token of the system prompt, user message, prior conversation turn, retrieved document, and generated response counts against it. When the window fills up, something has to be dropped — usually the oldest turns of a chat, or the least-relevant retrieved chunks.

Tokens are not words. A typical English word tokenizes to roughly 1.3 tokens in modern vocabularies; code, numbers, and non-English scripts are less efficient. A 200K-token window holds about 150,000 English words — roughly a 500-page novel. A 2M-token window holds a mid-sized codebase or several hours of transcribed audio. For the mechanics of how text becomes tokens, see the tokenization explainer.

Input vs output budgets

Most vendors split the context window between input and output. Claude’s 200K window, for example, is shared — a long prompt leaves less room for a long answer. GPT-4o reserves a separate output cap (currently 16K tokens) inside its 128K window. When budgeting, treat the window as a single pool and leave explicit headroom for the response.

Current limits across major models

Context-window sizes have grown roughly 100× in three years. A snapshot of where frontier APIs sit in 2026:

OpenAI GPT-4o — 128K tokens input, 16K output cap.
OpenAI GPT-4.1 — 1M tokens, positioned for long-document and code-base work.
Anthropic Claude (Sonnet/Opus 4.x) — 200K tokens standard; a 1M-token beta is available to enterprise tier-4 accounts.
Google Gemini 1.5/2.0 Pro — 2M tokens, the largest readily available window on a frontier model.
Meta Llama 4 — advertised at up to 10M tokens for the Scout variant, though practical usage at that length is largely untested.

These numbers move quarterly; always check the vendor’s current pricing and capability page before designing around a specific figure.

Why limits differ

Larger windows cost more to train, serve, and evaluate. A vendor’s published ceiling reflects a trade-off between how much data they trained the positional embeddings on, how much GPU memory their serving stack allocates per request, and how aggressively they are willing to price long-context inference.

Why attention is expensive: the O(n²) problem

Vanilla self-attention, the core operation in transformers, compares every token to every other token. For a sequence of length n, that is n² pairwise comparisons, and the intermediate attention matrix requires memory proportional to n² as well. Doubling the window from 32K to 64K tokens roughly quadruples both compute and memory for the attention step.

This is why the jump from 4K to 128K windows required architectural changes, not just bigger machines. Naïvely running attention on a 2M-token sequence would require moving petabytes of activations through GPU memory per layer. Real long-context systems use a combination of engineering tricks and algorithmic approximations.

Sparse and sliding-window attention

Sparse attention restricts each token to attending only to a subset of other tokens — for example, all tokens within a fixed window plus a handful of “global” tokens (a pattern popularized by Longformer and BigBird). Sliding-window attention is the common special case: every token attends to the previous k tokens. Complexity drops from O(n²) to O(n·k), trading some modelling power for tractable scaling.

Position extension: YaRN, ALiBi, and friends

Transformers need a way to represent where each token sits in the sequence. Rotary position embeddings (RoPE), used in Llama and Qwen, encode position as a rotation applied to token embeddings. YaRN (Yet another RoPE extensioN) lets a model trained on, say, 4K tokens operate on 128K by rescaling those rotations with a modest amount of continued training.

ALiBi (Attention with Linear Biases) takes a different route: it drops learned positional embeddings entirely and adds a linear distance penalty to attention scores. Models trained with ALiBi extrapolate to longer sequences than they saw in training, because the bias generalizes naturally.

RAG as context extension

Retrieval-augmented generation reframes the problem: instead of stuffing a million tokens into every call, keep a small index of documents and retrieve only the passages relevant to the current query. The context window handles the working set; the index holds the long-term memory.

rag is cheaper, more cache-friendly, and lets the knowledge base update without retraining or reloading the full corpus. The cost is complexity — a retriever, an embedding model, a vector store, and a ranking strategy. It also introduces a new failure mode: the model cannot reason about what was not retrieved.

Long context vs RAG: when each wins

Long context is simpler and better when the full material genuinely matters end-to-end — a single contract, a whole codebase during a refactor, a long technical spec under review. RAG is better when the knowledge base is large, updates often, or most of it is irrelevant to any given query. Many production systems combine both: retrieve the top-k most relevant documents and feed them into a long-context model that can then reason across them.

The needle in a haystack test

A published context window says nothing about comprehension. The “needle in a haystack” test measures whether a model can retrieve a specific fact planted somewhere in a very long input. An early version asked, “What is the best thing to do in San Francisco?” with the answer buried inside an unrelated long document.

Gemini 1.5 Pro reports >99.7% recall up to 1M tokens on this test. Claude and GPT-4o score similarly on text. But newer evaluations — multi-needle, multi-hop, and “reasoning over middle” variants — show the real picture: models degrade as needles move toward the middle of the prompt, as more needles are required, or as the task requires combining information rather than retrieving it. Published 2M windows are not 2M tokens of reliable reasoning.

Practical reading of the results

Treat the advertised window as a hard ceiling, not an operating range. Keep critical instructions at the very start or very end of the prompt (where recall is best). Break long tasks into stages. Test your own prompts empirically — vendors benchmark on synthetic haystacks, but your workload is real documents with real redundancy, which behaves differently.

Practical implications for application design

Prompt design within a token budget

Every token costs money and occupies space that could hold more relevant content. Prune verbose system prompts. Compress retrieved documents before inclusion. Use structured formats (JSON, Markdown tables) that tokenize efficiently. Log actual token counts per request; what feels like a short prompt can surprise you when a chat history accumulates.

Cost tradeoffs

Long-context inference is priced per token, and some vendors charge a premium above a threshold. A 1M-token Gemini call can cost several dollars per request; running that at scale blows up monthly bills fast. Prompt caching (available on Anthropic, OpenAI, and Google) cuts repeated input costs by up to 90% when the same context prefix is reused across calls — essential for any agent or chat product with stable system prompts.

Architectural choice: window size as a design variable

The window is not just a capacity limit, it is a design parameter. A chat product with a 32K working window, aggressive summarization, and good retrieval can outperform a naïve 2M-token firehose — cheaper, faster, more predictable. Pick the smallest window that cleanly fits the task, and let cost and latency settle the rest.

Frequently asked questions

Does a bigger context window always mean a better answer?
No. Larger windows let a model consider more material, but they also dilute attention. Empirical studies show that for many tasks, a well-curated 32K prompt outperforms the same content padded to 200K. More context also means more opportunity for distracting or contradictory passages to derail the model.

Should I use a long context window or a RAG system?
Use long context when the full material matters and is stable — a single legal document, one codebase, one research paper. Use rag when the knowledge base is large, grows over time, or most of it is irrelevant to any given query. Many production systems combine both: retrieve the top-k relevant passages, then feed them into a long-context model that reasons across them.

How do I count tokens accurately?
Use the tokenizer that ships with your target model — OpenAI’s tiktoken, Anthropic’s token counting endpoint, Google’s Gemini token counting API. A rough rule of thumb is 1 token per 4 English characters or 0.75 tokens per word, but this varies substantially across languages and content types. Code, numbers, and non-English text all tokenize less efficiently than English prose.