Context Compression for Long-Running AI Agents
Long-running AI agents hit a hard wall: every message adds context, token costs spiral, and response latency becomes unbearable—but strategic context compression keeps your Claude agents fast and cheap.
Why Context Grows Into a Problem
AI agents that run for hours, days, or across multiple conversations accumulate memory. Each new interaction includes all previous context: system prompts, conversation history, retrieved documents, tool outputs. With Claude, this means tokens multiply fast.
At $3 per 1M input tokens, a 100k-token conversation costs $0.30. Scale to 10 concurrent agents running daily, and you're burning budget on redundant context. Worse: larger context windows increase latency by 200-400ms, breaking the responsiveness users expect.
Selective History Retention
Not all history matters equally. Recent exchanges inform the agent's current task; old interactions become noise. Implement a sliding window: keep only the last N turns, or retain messages within the last M hours.
In Supabase, store conversations with timestamps and mark 'active' messages. Query only active messages when building the context window. For a customer support agent, you might keep the last 8 turns but summarize anything older than 24 hours into a single brief recap.
const activeMessages = await supabase
.from('messages')
.select('role, content')
.eq('agent_id', agentId)
.gt('created_at', new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString())
.order('created_at', { ascending: false })
.limit(8);Summarization at Conversation Boundaries
When an agent completes a task or workflow, summarize the entire exchange into a single 'session summary' message. This becomes the new context seed for future conversations with the same user.
Call Claude to generate a 200-300 token summary: goals achieved, decisions made, relevant facts. Replace the full history with one summary line. A data analysis agent might summarize: 'User analyzed Q3 sales data, identified a 12% decline in region 4, recommended inventory reduction.'
Semantic Deduplication
Agents often re-state the same facts or constraints. If the system prompt already covers a rule, don't repeat it in context. Use embedding-based similarity checks to detect near-duplicate information and remove lower-confidence versions.
Before adding a retrieved document or user clarification to context, compare its embedding against existing context. If similarity exceeds 0.92, skip it. This is especially useful for agents that query knowledge bases repeatedly.
Tool Output Caching
Agents call tools (APIs, databases, searches) constantly. Tool outputs don't need to live forever in context. Cache results with a TTL: keep a database query result for 10 minutes, a web search for 1 hour. If the agent asks the same question within the TTL, return the cached result without bloating context.
In Next.js, use Redis or a simple Supabase table to store hash(tool_input) → output. Check the cache before calling the actual tool. This cuts both API calls and token usage.
Progressive Context Building
Don't load full context upfront. Build messages incrementally: system prompt → last 2 turns → relevant docs → tool results. Stop adding context once you hit 80% of your target token budget. Claude works fine with less context if it's high-signal.
This requires tuning per use case, but the payoff is real: less latency, lower cost, and often better outputs because the agent isn't distracted by noise.
Open-source implementation
Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.
◈ Tools mentioned
- Supabase — open-source Firebase alt
- Vercel — zero-config Next.js hosting
- Claude — AI assistant by Anthropic
- Gumroad — sell digital products
Some links may pay us a referral if you sign up. Never affects the price you pay.
Get the full starter kit
Start with selective history retention and summarization—most agents save 30-50% tokens immediately. Use the open-source Pantheon implementation to integrate these patterns into your Next.js + Supabase stack today.