Building a Stuck-Task Watchdog for AI Agent Pools
When you're running a pool of AI agents processing tasks asynchronously, silent failures and stuck tasks will eventually cost you—a watchdog system that detects hangs and auto-recovers can save hours of debugging and lost revenue.
Why Stuck Tasks Happen in Agent Pools
AI agent tasks can hang for several reasons: Claude API timeouts, network interruptions, infinite loops in agent logic, or database locks that block state updates. In a pool of concurrent agents, one stuck task is invisible until your monitoring gaps it.
Without visibility, a task marked 'processing' never moves to 'completed' or 'failed'. Your queue stays dirty. New agents skip over it. Users never hear back. The cost compounds.
Detecting Task Hangs with Heartbeat Intervals
The simplest reliable pattern: record a `last_activity_at` timestamp every time an agent updates task state, and run a scheduled check every 2–5 minutes. If a task hasn't moved in longer than your timeout threshold (e.g., 10 minutes for long-running inference), flag it.
This requires minimal overhead—a single Postgres query checking `NOW() - last_activity_at > interval '10 minutes'` and `status = 'processing'`. No complex distributed tracing needed for indie-scale systems.
SELECT id, agent_id, created_at, last_activity_at FROM tasks WHERE status = 'processing' AND (NOW() - last_activity_at) > interval '10 minutes' ORDER BY last_activity_at ASC;Implementing Recovery with Exponential Backoff
Once you detect a stuck task, you have three safe moves: reset it to 'pending' for retry, escalate it to a higher-priority queue, or move it to a 'stuck' status with manual intervention required.
Track retry count in your schema. Increment it on recovery. After 3 retries, stop auto-recovery and send an alert. This prevents runaway loops while giving the task reasonable chances to succeed.
Building the Watchdog Service in Next.js
Implement the watchdog as a Next.js API route that runs on a cron schedule (using a service like EasyCron or Vercel Crons). The route queries Supabase for stuck tasks, updates their status, and fires webhook notifications to your agent pool coordinator.
Keep the watchdog idempotent: if it runs twice in 30 seconds, it should not double-recover the same task. Use database-level locks or add a `recovered_at` timestamp check to prevent duplicates.
Monitoring and Alerting Setup
Log every watchdog run and recovery attempt. Track metrics: tasks recovered per hour, retry success rate, and time-to-recovery. Spike in stuck tasks often signals a deeper issue (API rate limits, database performance).
Send Slack or email alerts when recovery count exceeds a threshold. This early warning system gives you time to fix root causes before they cascade.
Open-Source Implementation: Pantheon
The pantheon repository (github.com/lewisallena17/pantheon) provides a production-ready reference implementation of task pooling and watchdog systems for Claude-powered agents. It includes Supabase schema templates, Next.js API handlers, and retry logic you can fork and customize.
The repo demonstrates real error handling patterns and shows how to integrate task state machines with Claude's API lifecycle. Use it as a starting point rather than a black box.
Open-source implementation
Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.
◈ Tools mentioned
- Supabase — open-source Firebase alt
- Vercel — zero-config Next.js hosting
- Claude — AI assistant by Anthropic
- Gumroad — sell digital products
Some links may pay us a referral if you sign up. Never affects the price you pay.
Get the full starter kit
Build a heartbeat-based watchdog into your agent pool now—it takes a few hours and will prevent silent failures that cost you later. Start with the Pantheon reference implementation and customize for your stack.