cost optimizationtokenstutorialprompt caching

LLM Context Window Management: RAG vs Compression vs Full Context

When to use RAG, when to summarize conversation history, and when a 1M token context window is actually worth the cost. A practical decision guide for developers.

TTokenCost Editorial·LLM Cost Research·Updated 2026-05-276 min read

Context window limits determine what your LLM can "see" in a single request. In 2026, frontier models offer 128k to 2M token windows — but larger context means higher cost per request. This guide explains how to manage context strategically: what to include, what to compress, and when to reach for a long-context model vs a RAG approach.

Context Window Sizes in 2026

Model	Context Window	~Words	Input /1M	Cost for Full Context
Gemini 3 Ultra	2M	~1.5M	$10	$20.00
Claude Sonnet 4.6	1M	~750k	$3	$3.00
GPT-4.1	1M	~750k	$2	$2.00
Gemini 2.5 Pro	1M	~750k	$1.25	$1.25

"Cost for Full Context" shows what a single request costs if you fill the entire context window. Filling a 1M token window with Claude Sonnet costs $3.00 — for input alone, before output tokens. At scale, context management is essential.

The Three Strategies for Long Context

RAG

Retrieve only relevant chunks. Best for large static knowledge bases with targeted queries.

Cost: Lowest

Compression

Summarize old turns or distant context. Best for long conversations and iterative workflows.

Cost: Low

Full Context

Send everything. Best when relationships between distant parts matter (full codebases, legal docs).

Cost: High

Strategy 1: RAG (Retrieval-Augmented Generation)

RAG is the most cost-effective approach for knowledge-base applications. Instead of inserting an entire document corpus into the prompt, you embed documents as vectors and retrieve only the top-k relevant chunks at query time. A knowledge base with 10,000 pages becomes a 2,000-token prompt instead of a 5,000,000-token prompt.

1. Chunk documents

Split into 500–1,000 token chunks with overlap (~10%). Smaller chunks = more precise retrieval but more chunks to manage.

2. Embed chunks

Use a cheap embedding model (OpenAI text-embedding-3-small at $0.02/1M tokens). Store vectors in a vector DB (Pinecone, Weaviate, pgvector).

3. Retrieve top-k at query time

k=3 to 5 chunks is typically optimal. Re-rank by semantic relevance if precision is critical.

4. Insert into prompt with citation numbers

Number each chunk [1], [2], [3] so the model can cite sources. Keep metadata in a system-level header, not per chunk.

RAG limitation: it fails when the answer requires synthesizing information spread across many non-contiguous parts of a large document. For those cases, use full-context models.

Strategy 2: Context Compression

For long conversations or multi-step agent workflows, context grows with every turn. Two compression techniques manage this without losing important information:

Sliding Window

Keep only the last N turns in full detail. Discard turns older than the window. Works well for customer support (recent context is most relevant) but loses historical context that may matter for complex tasks.

Progressive Summarization

When the conversation exceeds a token threshold, call the model to summarize the oldest N turns into a compact summary, then replace those turns with the summary. A 20-turn conversation that would fill 8,000 tokens becomes a 300-token summary plus the recent 5 turns.

Summarization prompt pattern:

"Summarize the following conversation history in 3–5 bullet points,

preserving key decisions, facts, and user preferences."

Strategy 3: When to Use Full Long-Context Models

Some tasks genuinely require the full context. The following cases justify a long-context model over RAG:

Whole-codebase refactoring

The model needs to see all files simultaneously to reason about cross-file dependencies

Legal contract review

Clause references span the entire document — retrieving chunks risks missing critical cross-references

Long research paper synthesis

Arguments build across the paper; out-of-order retrieval breaks the reasoning chain

Full conversation replay

Re-analyzing an entire historical conversation for patterns or decisions

Prompt Caching: The Best Tool for Repeated Long Context

When the same long context is sent repeatedly — same system prompt, same document, same codebase — prompt caching reduces the cost of the repeated portion by 50–90% depending on provider.

Provider	Cache Discount	Cache TTL	Min Cache Size
Anthropic Claude	90%	5 min (extendable)	1,024 tokens
OpenAI GPT	50%	5–10 min	1,024 tokens
Google Gemini	75%	1 hour (default)	4,096 tokens

Structure your prompts to put the static context (system prompt, documents) at the beginning and variable content (user query) at the end. The cached prefix must be identical across requests — even a single token difference invalidates the cache.

Bottom Line

Default to RAG for knowledge-base applications — it's 10–100x cheaper than full context. Use progressive summarization for long conversations. Reserve full long-context models for tasks where cross-document relationships genuinely matter. And always enable prompt caching on the static prefix of your prompts — it's the single highest-leverage cost reduction for repeated context.

Cheapest LLM API in 2026: Full Price Comparison

We compared 26 LLM models across 8 providers to find the cheapest API for every use case — from bulk processing to complex reasoning.

8 min read

7 Ways to Reduce Your OpenAI API Cost by 80%

Practical techniques to dramatically cut your OpenAI API bill: prompt caching, model routing, batch API, and token optimization strategies.

6 min read

Prompt Caching: Save Up to 90% on LLM API Costs

Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.

5 min read