Token Optimization: 7 Techniques to Cut Your LLM API Token Count by 50%
Compress system prompts, switch to structured output, truncate context, and control output length. Practical token optimization techniques that reduce API costs by 40–60% without hurting quality.
Your LLM API bill is directly proportional to the number of tokens you send and receive. Most applications waste 30–60% of their tokens on padding, redundancy, and poor prompt structure. This guide covers 7 concrete techniques to cut your token count — and your API cost — without degrading output quality.
1. Compress Your System Prompt
System prompts are the single highest-leverage token optimization target — they're sent on every request. A bloated 2,000-token system prompt at 10,000 requests/day costs $60/day just for the prompt. Cutting it to 800 tokens saves 60%.
Rules of thumb: remove filler phrases ("always try to", "should always"), eliminate self-evident instructions ("be helpful", "be polite"), and collapse multi-sentence rules into one-liners. LLMs are trained to follow concise instructions just as well as verbose ones.
2. Use Structured Output to Eliminate Prose
If you're parsing the model's output anyway, ask for JSON directly. Prose responses are verbose by design — they include connectives, hedges, and formatting that consume output tokens without adding machine-readable value.
OpenAI's JSON mode, Anthropic's tool use, and Google's structured output all enforce JSON responses. For classification, extraction, and scoring tasks, this alone typically reduces output tokens by 60–80%.
3. Truncate Input at the Right Point
LLMs charge for every input token, but attention degrades on content far from the beginning or end of the context window ("lost in the middle" problem). Sending 50,000 tokens when 5,000 are relevant wastes 90% of your input budget.
4. Limit Output Length Explicitly
By default, models generate until they naturally stop — often producing longer outputs than needed. Explicit length constraints are the most direct way to reduce output token cost.
Combining max_tokens with an explicit instruction is more reliable than either alone — the instruction sets the model's intent, max_tokens is the hard guardrail.
5. Remove Redundant Context
A common pattern in RAG applications is inserting the same document metadata on every chunk: source URL, date, author, confidence score. If you're inserting 10 chunks with 50 tokens of metadata each, that's 500 tokens of overhead. Consolidate metadata into a single header, or move it to the system prompt if it's static.
6. Use Few-Shot Examples Sparingly
Few-shot examples are expensive: a single example with 200 tokens of input + output costs 200 tokens on every request. 5 examples = 1,000 tokens of overhead per call. Modern frontier models (GPT-4.1, Claude Sonnet 4.6) often match few-shot performance with well-written zero-shot instructions — test before assuming you need examples.
When you do need few-shot examples: use the shortest examples that demonstrate the pattern, cache them with prompt caching (90% discount on Claude, 50% on OpenAI), and periodically evaluate whether fewer examples achieve the same accuracy.
7. Choose the Right Tokenizer
Different providers use different tokenizers. The same text can produce different token counts depending on the model. GPT-4 family uses cl100k_base (tiktoken). Claude uses a different BPE tokenizer. Code, tables, and non-English text tokenize differently — sometimes 2–3x more tokens than English prose of the same length.
Use our token counter tool to measure exact token counts for your prompts before optimizing.
Quick Wins Summary
Bottom Line
Start with the low-effort wins: compress your system prompt, switch to structured output, and add explicit length constraints. These three changes alone typically reduce total token spend by 30–50% without any change to output quality. Then measure actual token usage with the token counter, identify your highest-volume prompt patterns, and apply the remaining techniques to those first.
See also: Prompt Caching Guide · Token Counter · 8 Ways to Reduce LLM API Costs →