Five Techniques for Reducing AI Agent Token Spend
Token costs can spiral quickly when building AI agents—but with the right techniques, you can reduce spending by 40-60% without sacrificing performance or response quality.
1. Implement Prompt Caching for Repeated Context
Claude's prompt caching feature stores large system prompts and context blocks, charging only 10% of the cache creation cost for cache hits. If your agent loads the same 10KB system prompt 100 times monthly, you're reusing ~1M tokens at 90% discount.
Enable caching by setting `cache_control: {"type": "ephemeral"}` on your system prompt. This works best for agents with stable instructions, lengthy examples, or documentation that doesn't change per-request.
2. Batch Non-Urgent Requests with the Batch API
The Batch API processes requests at 50% discount if you can wait 1 minute to 24 hours for results. For async workflows—report generation, data enrichment, background analysis—batching is nearly free savings.
Group 100 requests into a single batch job instead of making individual API calls. You'll pay ~3-4K tokens per 1000-token job instead of the standard rate.
const batchRequests = [
{
custom_id: "task-1",
params: {
model: "claude-3-5-sonnet-20241022",
max_tokens: 256,
messages: [{role: "user", content: "Analyze this data..."}]
}
}
];
const batch = await client.beta.messages.batches.create({
requests: batchRequests
});3. Use Vision Token Optimization for Image Inputs
Images in Claude consume tokens proportional to their resolution and format. A 1080p image costs ~770 tokens; resize to 512p and you're at ~168 tokens—an 78% reduction with minimal quality loss for most agent tasks.
For screenshots, invoices, or diagrams, compress to max 768px width and JPEG quality 75-80 before sending. Your agent still reads content accurately while cutting image token spend dramatically.
4. Limit Output Length with Max Tokens Parameter
Every token generated costs you money. Set `max_tokens` to the minimum your agent actually needs—if you're extracting structured data, cap output at 500 tokens instead of defaulting to 4096. This prevents verbose rambling and forces concise responses.
Pair this with stricter prompts: instead of "explain thoroughly," ask for "a 1-sentence summary and 3 bullet points."
5. Route Requests by Complexity with Smart Model Selection
Claude 3.5 Haiku handles 80% of lightweight tasks (classification, extraction, formatting) at 1/3 the cost of Sonnet. Route simple queries to Haiku, reserve Sonnet for complex reasoning, research, or code generation.
Implement a router that checks request complexity (token count, instruction length) and selects the cheapest model that meets quality thresholds. This hybrid approach cuts spend 30-40% across your agent fleet.
Open-Source Implementation: Pantheon
The Pantheon framework (github.com/lewisallena17/pantheon) provides production-ready Next.js + Supabase templates with token optimization built in. It includes prompt caching middleware, batch job orchestration, and per-agent cost tracking.
Deploy a multi-agent system with cost monitoring in minutes—Pantheon handles caching headers, request batching, and expense dashboards so you focus on agent logic, not infrastructure.
Open-source implementation
Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.
◈ Tools mentioned
- Supabase — open-source Firebase alt
- Vercel — zero-config Next.js hosting
- Claude — AI assistant by Anthropic
- Gumroad — sell digital products
Some links may pay us a referral if you sign up. Never affects the price you pay.
Get the full starter kit
Combine prompt caching, batch processing, vision optimization, output limits, and smart model routing to cut token spend 40-60%—start with Pantheon or apply these techniques incrementally to your existing Claude agents.