The Self-Improving AI Orchestrator Pattern

The Self-Improving AI Orchestrator Pattern lets your agent system evaluate its own outputs, log failures to a database, and automatically refine its prompts and routing logic — so your product gets smarter every time it runs, without you manually tuning it.

◆ The Kit
Pantheon Starter Kit — Build your own autonomous AI workforce
Full Next.js + Supabase + Claude codebase. 9 PM2 agents wired up. Cost guardrails included. 43 SEO-ready topic pages with AdSense + affiliate slots already plumbed.
$39
buy on gumroad →
ADVERTISEMENT

What the Pattern Actually Does

At its core, the pattern wraps every agent execution in an eval loop. After each task completes, a critic agent scores the output against a rubric, writes a structured feedback record to Supabase, and optionally triggers a prompt-rewrite agent that patches the system prompt for the next run.

This is different from simple retry logic. The orchestrator isn't just re-running failed tasks — it's accumulating a ground-truth dataset of what worked, what didn't, and why. Over dozens of runs you get a self-correcting system without manually reviewing logs.

ADVERTISEMENT
Get the Pantheon Starter Kit$39
◇ no time to read?
Get one tight email when I publish something worth sharing — autonomous AI agents, cost engineering, post-mortems. No spam, no SaaS pitches.

Core Architecture: Orchestrator, Worker, Critic

Split your agent graph into three roles. The Orchestrator decomposes the goal and routes subtasks. Workers execute discrete tasks using Claude tool-use calls. The Critic is a separate Claude call that receives the worker's output plus the original intent and returns a JSON score object with a pass/fail flag and a reason string.

Keep the Critic prompt stateless and deterministic. Give it a fixed rubric so scores are comparable across runs. The Orchestrator reads the score and decides whether to mark the task complete, retry with a modified prompt, or escalate to a human review queue.

Storing Feedback in Supabase

Every critic evaluation gets written to an agent_runs table. Querying this table later lets you find which prompt variants produced the highest pass rates, which task types fail most often, and what time-of-day patterns exist in failures.

Here is the minimal table schema and a TypeScript insert you can drop into a Next.js API route:

-- Supabase migration
create table agent_runs (
  id          uuid primary key default gen_random_uuid(),
  created_at  timestamptz default now(),
  task_type   text not null,
  prompt_hash text not null,
  passed      boolean not null,
  score       numeric(4,2),
  reason      text,
  raw_output  jsonb
);

// app/api/agent/route.ts (Next.js App Router)
import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

export async function logRun(run: {
  taskType: string;
  promptHash: string;
  passed: boolean;
  score: number;
  reason: string;
  rawOutput: object;
}) {
  const { error } = await supabase.from('agent_runs').insert({
    task_type:   run.taskType,
    prompt_hash: run.promptHash,
    passed:      run.passed,
    score:       run.score,
    reason:      run.reason,
    raw_output:  run.rawOutput,
  });
  if (error) throw new Error(`Supabase insert failed: ${error.message}`);
}

Prompt Mutation: Closing the Loop

Once you have 20+ runs for a given task type, you can run a nightly prompt-optimizer job. Feed Claude the top 5 failed runs (reason + raw_output), the current system prompt, and ask it to produce a revised prompt that addresses the failure pattern. Store the new prompt with an incremented version number and A/B test it against the old one.

Use the prompt_hash column to track which version produced which result. This gives you a reproducible improvement cycle: collect, analyze, mutate, measure.

Avoiding Common Failure Modes

The two biggest pitfalls are reward hacking and runaway mutation. If your critic rubric is loose, the rewrite agent will find prompt phrasings that score well without actually improving output quality. Write rubric criteria against observable, concrete properties of the output — not vibes.

Cap mutation depth. Store a parent_prompt_id foreign key and refuse to apply a rewrite if the chain depth exceeds a threshold (5 is a safe default). This prevents the system from drifting so far from the original intent that outputs become unrecognizable.

Open-Source Implementation

A working reference implementation of the Self-Improving AI Orchestrator Pattern is available in the Pantheon repo at github.com/lewisallena17/pantheon. It ships with the Supabase schema, a Next.js orchestrator API route, pre-built Critic and Worker prompt templates for Claude, and a prompt-version management utility.

Fork it, point it at your own Supabase project, add your Anthropic API key, and you have a running self-improving agent pipeline in under 30 minutes. The repo is MIT-licensed and accepts PRs for new task-type templates.

Open-source implementation

Everything in this article runs in pantheon — a production-ready Next.js + Supabase + Claude starter. Clone it, deploy to Vercel, run PM2. The dashboard auto-commits every agent edit and reverts itself if TypeScript breaks.

◈ Tools mentioned

Some links may pay us a referral if you sign up. Never affects the price you pay.

Get the full starter kit

Implement the Self-Improving AI Orchestrator Pattern today by forking the Pantheon starter kit at github.com/lewisallena17/pantheon — ship a Claude agent system that measurably improves itself on every run.