cost optimizationtokenstutorialprompt caching

LLM Context Window Management: RAG vs Compression vs Full Context

When to use RAG, when to summarize conversation history, and when a 1M token context window is actually worth the cost. A practical decision guide for developers.

TTokenCost Editorial·LLM Cost Research·Updated 2026-05-276 min read

Context window limits determine what your LLM can "see" in a single request. In 2026, frontier models offer 128k to 2M token windows — but larger context means higher cost per request. This guide explains how to manage context strategically: what to include, what to compress, and when to reach for a long-context model vs a RAG approach.

Context Window Sizes in 2026

ModelContext Window~WordsInput /1MCost for Full Context
Gemini 3 Ultra2M~1.5M$10$20.00
Claude Sonnet 4.61M~750k$3$3.00
GPT-4.11M~750k$2$2.00
Gemini 2.5 Pro1M~750k$1.25$1.25

"Cost for Full Context" shows what a single request costs if you fill the entire context window. Filling a 1M token window with Claude Sonnet costs $3.00 — for input alone, before output tokens. At scale, context management is essential.

The Three Strategies for Long Context

RAG
Retrieve only relevant chunks. Best for large static knowledge bases with targeted queries.
Cost: Lowest
Compression
Summarize old turns or distant context. Best for long conversations and iterative workflows.
Cost: Low
Full Context
Send everything. Best when relationships between distant parts matter (full codebases, legal docs).
Cost: High

Strategy 1: RAG (Retrieval-Augmented Generation)

RAG is the most cost-effective approach for knowledge-base applications. Instead of inserting an entire document corpus into the prompt, you embed documents as vectors and retrieve only the top-k relevant chunks at query time. A knowledge base with 10,000 pages becomes a 2,000-token prompt instead of a 5,000,000-token prompt.

1. Chunk documents
Split into 500–1,000 token chunks with overlap (~10%). Smaller chunks = more precise retrieval but more chunks to manage.
2. Embed chunks
Use a cheap embedding model (OpenAI text-embedding-3-small at $0.02/1M tokens). Store vectors in a vector DB (Pinecone, Weaviate, pgvector).
3. Retrieve top-k at query time
k=3 to 5 chunks is typically optimal. Re-rank by semantic relevance if precision is critical.
4. Insert into prompt with citation numbers
Number each chunk [1], [2], [3] so the model can cite sources. Keep metadata in a system-level header, not per chunk.

RAG limitation: it fails when the answer requires synthesizing information spread across many non-contiguous parts of a large document. For those cases, use full-context models.

Strategy 2: Context Compression

For long conversations or multi-step agent workflows, context grows with every turn. Two compression techniques manage this without losing important information:

Sliding Window

Keep only the last N turns in full detail. Discard turns older than the window. Works well for customer support (recent context is most relevant) but loses historical context that may matter for complex tasks.

Progressive Summarization

When the conversation exceeds a token threshold, call the model to summarize the oldest N turns into a compact summary, then replace those turns with the summary. A 20-turn conversation that would fill 8,000 tokens becomes a 300-token summary plus the recent 5 turns.

Summarization prompt pattern:
"Summarize the following conversation history in 3–5 bullet points,
preserving key decisions, facts, and user preferences."

Strategy 3: When to Use Full Long-Context Models

Some tasks genuinely require the full context. The following cases justify a long-context model over RAG:

Whole-codebase refactoring
The model needs to see all files simultaneously to reason about cross-file dependencies
Legal contract review
Clause references span the entire document — retrieving chunks risks missing critical cross-references
Long research paper synthesis
Arguments build across the paper; out-of-order retrieval breaks the reasoning chain
Full conversation replay
Re-analyzing an entire historical conversation for patterns or decisions

Prompt Caching: The Best Tool for Repeated Long Context

When the same long context is sent repeatedly — same system prompt, same document, same codebase — prompt caching reduces the cost of the repeated portion by 50–90% depending on provider.

ProviderCache DiscountCache TTLMin Cache Size
Anthropic Claude90%5 min (extendable)1,024 tokens
OpenAI GPT50%5–10 min1,024 tokens
Google Gemini75%1 hour (default)4,096 tokens

Structure your prompts to put the static context (system prompt, documents) at the beginning and variable content (user query) at the end. The cached prefix must be identical across requests — even a single token difference invalidates the cache.

Bottom Line

Default to RAG for knowledge-base applications — it's 10–100x cheaper than full context. Use progressive summarization for long conversations. Reserve full long-context models for tasks where cross-document relationships genuinely matter. And always enable prompt caching on the static prefix of your prompts — it's the single highest-leverage cost reduction for repeated context.

Related: Prompt Caching Guide · Token Optimization: 7 Techniques · Cost Routing Tool →

Related Articles

Cheapest LLM API in 2026: Full Price Comparison
We compared 26 LLM models across 8 providers to find the cheapest API for every use case — from bulk processing to complex reasoning.
8 min read
7 Ways to Reduce Your OpenAI API Cost by 80%
Practical techniques to dramatically cut your OpenAI API bill: prompt caching, model routing, batch API, and token optimization strategies.
6 min read
Prompt Caching: Save Up to 90% on LLM API Costs
Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.
5 min read