Cutting Claude API Costs with Prompt Caching
Prompt caching cuts your Claude API costs by up to 90% by reusing expensive token computations—here's exactly how to implement it in your Next.js agent stack.
How Prompt Caching Saves Money
Claude's prompt caching feature stores the processing results of system prompts and context blocks, letting you reuse them without repaying full token costs. Cached tokens cost 90% less than regular input tokens—meaning a 10k-token system prompt costs ~100 tokens on subsequent API calls instead of 10,000.
For AI agents that repeatedly process the same context (product documentation, user profiles, system instructions), this compounds fast. A production agent running 100 requests daily with a 5k-token cached context saves $500+ monthly.
Setting Up Cache in Next.js with Claude SDK
Claude's SDK handles caching through request headers. Add `cache_control` to your system prompt and messages to mark them for caching. Cache writes cost full price but activate after 1,024 tokens; subsequent requests pay 90% less.
Here's a production-ready example:
const response = await client.messages.create({
model: 'claude-3-5-sonnet-20241022',
max_tokens: 1024,
system: [
{
type: 'text',
text: 'You are a specialized API agent...',
cache_control: { type: 'ephemeral' }
}
],
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: userQuery,
cache_control: { type: 'ephemeral' }
}
]
}
]
});Cache Types: Ephemeral vs. Session-Based
Ephemeral cache lasts 5 minutes per API session—perfect for high-frequency requests from the same user within a conversation. Use this for chatbots and real-time agents.
For longer-lived contexts (product schemas, system instructions), implement session-based caching by storing cache tokens in Supabase and reattaching them to requests. This requires tracking `cache_creation_input_tokens` and `cache_read_input_tokens` in response metadata.
Measuring Cache Hit Rates
Monitor cache effectiveness through Claude's response metadata. Every response includes `usage.cache_creation_input_tokens` (cache miss) and `usage.cache_read_input_tokens` (cache hit). Log these to Supabase to track ROI.
A healthy production agent targets 70%+ cache hit rates. If you're below 40%, your context blocks aren't stable enough—consolidate repetitive instructions into single cached blocks.
Common Pitfalls and Fixes
Cache invalidates if you modify system prompts or message content. Even whitespace changes reset the cache. Use feature flags or versioning for safe updates.
Don't cache user-specific data—it defeats the purpose. Cache only static system instructions, product documentation, and shared context. Dynamic user queries should sit outside cache blocks.
Open-Source Implementation
The Pantheon repository at github.com/lewisallena17/pantheon provides production-ready scaffolding for Claude agents with built-in prompt caching, Supabase integration, and cost tracking. It includes Next.js middleware for automatic cache header injection and a dashboard to monitor cache performance across your agent fleet.
Fork it and customize the system prompts for your use case—cache setup is already wired.
Open-source implementation
Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.
◈ Tools mentioned
- Supabase — open-source Firebase alt
- Vercel — zero-config Next.js hosting
- Claude — AI assistant by Anthropic
- Gumroad — sell digital products
Some links may pay us a referral if you sign up. Never affects the price you pay.
Get the full starter kit
Start caching your system prompts today—most indie developers see ROI within two weeks. Grab the Pantheon starter kit and begin cutting costs immediately.