Here's a pattern we see constantly: a team builds an agent using Claude Opus or GPT-4 for everything, ships it, and then panics when the monthly API bill arrives. A single agent processing 500 tasks per day can easily cost $3,000-5,000/month if every step uses a frontier model.
It doesn't have to be this way. Here are the techniques we use to cut agent costs by 50-70% while maintaining output quality.
1. Model Routing: Use the Right Model for Each Step
This is the single biggest cost lever. Not every step in your agent needs a frontier model.
Classification and routing: Use Haiku or GPT-4o-mini. These tasks are simple and smaller models handle them perfectly.
Data extraction: Sonnet-class models are more than enough for pulling structured data from documents.
Complex reasoning: Reserve Opus/GPT-4 for the steps that actually require deep reasoning, multi-step planning, or nuanced judgment.
Final output generation: If the customer sees the output, use the best model. If it's an internal intermediate step, use a cheaper one.
In practice, only 10-20% of agent steps truly need a frontier model. The rest can use models that cost 5-10x less.
2. Prompt Caching
If your agent uses a large system prompt (and most do), you're paying for those tokens on every single call. Both Anthropic and OpenAI now support prompt caching:
Cache the system prompt. A 2,000-token system prompt called 1,000 times per day costs $60/month at Opus pricing. With caching, it drops to $6.
Cache few-shot examples. If you include examples in your prompt, cache them. They don't change between calls.
Structure prompts for caching. Put static content first, dynamic content last. Caching works on prefix matches.
3. Smart Fallback Chains
Instead of always calling the most expensive model, try the cheap one first:
Send the request to a fast, cheap model (Haiku / GPT-4o-mini)
Check the output against quality criteria (confidence score, format validation, key requirements)
If it passes, done. Cost: $0.001
If it fails, escalate to the frontier model. Cost: $0.03
For many workloads, the cheap model handles 70-80% of requests correctly. You only pay premium prices for the hard cases. Overall cost drops by 60%+ with no quality degradation on the final output.
4. Reduce Token Usage
Tokens are money. Every unnecessary word in your prompt or output costs you:
Trim your system prompts. Most system prompts are 2-3x longer than they need to be. Cut the fluff, keep the instructions.
Use structured output. Ask for JSON instead of prose. It's shorter, cheaper, and easier to parse.
Limit output length. Set max_tokens appropriately. A classification task doesn't need 4,000 tokens of output.
Summarize conversation history. Instead of passing the full chat history, summarize older turns. Context windows are expensive to fill.
5. Batch Processing
If your agent handles tasks that aren't time-sensitive, batch them:
Anthropic's Batch API offers 50% cost reduction for non-real-time workloads
Process similar tasks together: group leads by type, documents by category. Shared context reduces per-task cost.
Off-peak scheduling: some providers offer better rates during low-demand periods
6. Cache Repeated Lookups
Agents often look up the same information repeatedly: company data, product details, policy documents. Cache aggressively:
RAG results: If the same query retrieves the same documents, cache the retrieval results (not just the embedding)
API responses: External API calls (company enrichment, shipping status) often return the same data within short windows
LLM responses: For deterministic tasks (classification, extraction), cache the model's response for identical inputs
Putting It All Together
Here's the typical cost reduction breakdown we see when applying all of these techniques:
Model routing: -40%
Prompt caching: -10%
Fallback chains: -15%
Token optimization: -10%
Response caching: -5%
Combined, these typically reduce costs by 50-70% compared to a naive implementation that uses the frontier model for everything.
The cheapest API call is the one you don't make. The second cheapest is the one you make with the right model.
Start Here
If you're only going to do one thing, implement model routing. Audit every step in your agent and ask: "Does this really need Opus/GPT-4?" The answer is usually no. Swap in cheaper models for the simple steps and watch your bill drop immediately.