Building a Stuck-Task Watchdog for AI Agent Pools

When you're running a pool of AI agents processing tasks asynchronously, silent failures and stuck tasks will eventually cost you—a watchdog system that detects hangs and auto-recovers can save hours of debugging and lost revenue.

◆ The Kit
Pantheon Starter Kit — Build your own autonomous AI workforce
Full Next.js + Supabase + Claude codebase. 9 PM2 agents wired up. Cost guardrails included. 43 SEO-ready topic pages with AdSense + affiliate slots already plumbed.
$39
buy on gumroad →
ADVERTISEMENT

Why Stuck Tasks Happen in Agent Pools

AI agent tasks can hang for several reasons: Claude API timeouts, network interruptions, infinite loops in agent logic, or database locks that block state updates. In a pool of concurrent agents, one stuck task is invisible until your monitoring gaps it.

Without visibility, a task marked 'processing' never moves to 'completed' or 'failed'. Your queue stays dirty. New agents skip over it. Users never hear back. The cost compounds.

ADVERTISEMENT
Get the Pantheon Starter Kit$39
◇ no time to read?
Get one tight email when I publish something worth sharing — autonomous AI agents, cost engineering, post-mortems. No spam, no SaaS pitches.

Detecting Task Hangs with Heartbeat Intervals

The simplest reliable pattern: record a `last_activity_at` timestamp every time an agent updates task state, and run a scheduled check every 2–5 minutes. If a task hasn't moved in longer than your timeout threshold (e.g., 10 minutes for long-running inference), flag it.

This requires minimal overhead—a single Postgres query checking `NOW() - last_activity_at > interval '10 minutes'` and `status = 'processing'`. No complex distributed tracing needed for indie-scale systems.

SELECT id, agent_id, created_at, last_activity_at FROM tasks WHERE status = 'processing' AND (NOW() - last_activity_at) > interval '10 minutes' ORDER BY last_activity_at ASC;

Implementing Recovery with Exponential Backoff

Once you detect a stuck task, you have three safe moves: reset it to 'pending' for retry, escalate it to a higher-priority queue, or move it to a 'stuck' status with manual intervention required.

Track retry count in your schema. Increment it on recovery. After 3 retries, stop auto-recovery and send an alert. This prevents runaway loops while giving the task reasonable chances to succeed.

Building the Watchdog Service in Next.js

Implement the watchdog as a Next.js API route that runs on a cron schedule (using a service like EasyCron or Vercel Crons). The route queries Supabase for stuck tasks, updates their status, and fires webhook notifications to your agent pool coordinator.

Keep the watchdog idempotent: if it runs twice in 30 seconds, it should not double-recover the same task. Use database-level locks or add a `recovered_at` timestamp check to prevent duplicates.

Monitoring and Alerting Setup

Log every watchdog run and recovery attempt. Track metrics: tasks recovered per hour, retry success rate, and time-to-recovery. Spike in stuck tasks often signals a deeper issue (API rate limits, database performance).

Send Slack or email alerts when recovery count exceeds a threshold. This early warning system gives you time to fix root causes before they cascade.

Open-Source Implementation: Pantheon

The pantheon repository (github.com/lewisallena17/pantheon) provides a production-ready reference implementation of task pooling and watchdog systems for Claude-powered agents. It includes Supabase schema templates, Next.js API handlers, and retry logic you can fork and customize.

The repo demonstrates real error handling patterns and shows how to integrate task state machines with Claude's API lifecycle. Use it as a starting point rather than a black box.

Open-source implementation

Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.

◈ Tools mentioned

  • Supabase — open-source Firebase alt
  • Vercel — zero-config Next.js hosting
  • Claude — AI assistant by Anthropic
  • Gumroad — sell digital products

Some links may pay us a referral if you sign up. Never affects the price you pay.

Get the full starter kit

Build a heartbeat-based watchdog into your agent pool now—it takes a few hours and will prevent silent failures that cost you later. Start with the Pantheon reference implementation and customize for your stack.