cost optimizationtokenstutorial

Token Optimization: 7 Techniques to Cut Your LLM API Token Count by 50%

Compress system prompts, switch to structured output, truncate context, and control output length. Practical token optimization techniques that reduce API costs by 40–60% without hurting quality.

TTokenCost Editorial·LLM Cost Research·Updated 2026-05-277 min read

Your LLM API bill is directly proportional to the number of tokens you send and receive. Most applications waste 30–60% of their tokens on padding, redundancy, and poor prompt structure. This guide covers 7 concrete techniques to cut your token count — and your API cost — without degrading output quality.

Expected savings

Applying all 7 techniques typically reduces total token spend by 40–60%. At $2/1M input (GPT-4.1), a 50% token reduction on 10M tokens/month saves $10/month — with zero change in output quality.

1. Compress Your System Prompt

System prompts are the single highest-leverage token optimization target — they're sent on every request. A bloated 2,000-token system prompt at 10,000 requests/day costs $60/day just for the prompt. Cutting it to 800 tokens saves 60%.

BEFORE — 47 tokens

You are a helpful AI assistant that is always polite and professional. You should always try to provide accurate and helpful information to users. When you don't know the answer to something, you should say so clearly and not make up information.

AFTER — 18 tokens

You are a professional assistant. Be accurate. If unsure, say so.

Rules of thumb: remove filler phrases ("always try to", "should always"), eliminate self-evident instructions ("be helpful", "be polite"), and collapse multi-sentence rules into one-liners. LLMs are trained to follow concise instructions just as well as verbose ones.

2. Use Structured Output to Eliminate Prose

If you're parsing the model's output anyway, ask for JSON directly. Prose responses are verbose by design — they include connectives, hedges, and formatting that consume output tokens without adding machine-readable value.

BEFORE — prose output (~120 tokens)

"Based on the review, the sentiment appears to be positive. The customer seems satisfied with the product quality, though they mentioned some concerns about shipping time."

AFTER — JSON output (~30 tokens)

{"sentiment":"positive","issues":["shipping"]}

OpenAI's JSON mode, Anthropic's tool use, and Google's structured output all enforce JSON responses. For classification, extraction, and scoring tasks, this alone typically reduces output tokens by 60–80%.

3. Truncate Input at the Right Point

LLMs charge for every input token, but attention degrades on content far from the beginning or end of the context window ("lost in the middle" problem). Sending 50,000 tokens when 5,000 are relevant wastes 90% of your input budget.

Chunk documents

Split long documents into chunks, embed them, and retrieve only the top-k relevant chunks (RAG). Send 2–3k tokens instead of 50k.

Truncate conversation history

Keep only the last N turns or summarize old turns. A 20-turn conversation can often be compressed to a 200-token summary.

Filter retrieved context

Re-rank retrieved chunks by relevance score. Drop anything below a threshold before inserting into the prompt.

4. Limit Output Length Explicitly

By default, models generate until they naturally stop — often producing longer outputs than needed. Explicit length constraints are the most direct way to reduce output token cost.

max_tokens: 150 — hard cap on API response length

"Reply in 2 sentences max." — instruction in system prompt

"Output only the JSON object, no explanation." — eliminates preamble

"Be concise. Omit filler phrases." — reduces verbosity by 20–30%

Combining max_tokens with an explicit instruction is more reliable than either alone — the instruction sets the model's intent, max_tokens is the hard guardrail.

5. Remove Redundant Context

A common pattern in RAG applications is inserting the same document metadata on every chunk: source URL, date, author, confidence score. If you're inserting 10 chunks with 50 tokens of metadata each, that's 500 tokens of overhead. Consolidate metadata into a single header, or move it to the system prompt if it's static.

BEFORE — 50 tokens of metadata per chunk × 10 chunks = 500 tokens

[Source: docs.example.com/pricing | Updated: 2026-05-01 | Category: Pricing | Confidence: 0.92]
...chunk content...

AFTER — metadata once in system prompt = ~15 tokens overhead total

[1] ...chunk content...
[2] ...chunk content...

6. Use Few-Shot Examples Sparingly

Few-shot examples are expensive: a single example with 200 tokens of input + output costs 200 tokens on every request. 5 examples = 1,000 tokens of overhead per call. Modern frontier models (GPT-4.1, Claude Sonnet 4.6) often match few-shot performance with well-written zero-shot instructions — test before assuming you need examples.

When you do need few-shot examples: use the shortest examples that demonstrate the pattern, cache them with prompt caching (90% discount on Claude, 50% on OpenAI), and periodically evaluate whether fewer examples achieve the same accuracy.

7. Choose the Right Tokenizer

Different providers use different tokenizers. The same text can produce different token counts depending on the model. GPT-4 family uses cl100k_base (tiktoken). Claude uses a different BPE tokenizer. Code, tables, and non-English text tokenize differently — sometimes 2–3x more tokens than English prose of the same length.

Code with many special characters

Often tokenizes at 1–2 chars/token vs 3–4 for English prose — cost can be 2x higher than expected

JSON with repeated keys

Keys like {`"inputPricePer1M"`} tokenize separately on every occurrence — use short key names for high-volume structured data

Non-English languages

Chinese, Japanese, Korean, Arabic tokenize at ~2–4x more tokens per character than English

Whitespace and newlines

Excessive blank lines, indentation in code blocks — each counts as tokens

Use our token counter tool to measure exact token counts for your prompts before optimizing.

Quick Wins Summary

Technique	Typical Saving	Effort
Compress system prompt	10–30% input	Low
Structured output (JSON)	60–80% output	Low
Truncate / RAG chunking	50–90% input	Medium
Explicit max_tokens + instruction	20–50% output	Low
Remove redundant metadata	5–20% input	Low
Reduce few-shot examples	10–40% input	Medium
Tokenizer-aware key naming	5–15% input	High

Bottom Line

Start with the low-effort wins: compress your system prompt, switch to structured output, and add explicit length constraints. These three changes alone typically reduce total token spend by 30–50% without any change to output quality. Then measure actual token usage with the token counter, identify your highest-volume prompt patterns, and apply the remaining techniques to those first.

Cheapest LLM API in 2026: Full Price Comparison

We compared 26 LLM models across 8 providers to find the cheapest API for every use case — from bulk processing to complex reasoning.

8 min read

7 Ways to Reduce Your OpenAI API Cost by 80%

Practical techniques to dramatically cut your OpenAI API bill: prompt caching, model routing, batch API, and token optimization strategies.

6 min read

Prompt Caching: Save Up to 90% on LLM API Costs

Everything you need to know about prompt caching across Anthropic, OpenAI, and Google — how it works, when to use it, and how much you save.

5 min read