What Happens When You Let an AI Rewrite Its Own Instructions?

Architecture for a self-improving agent platform -from research to blueprint

You know the feeling. You spend two hours crafting the perfect prompt. You nail the tone, the structure, the edge cases. It works beautifully -for about a week. Then the tasks shift slightly, and your carefully engineered prompt starts producing garbage. So you're back in the editor, tweaking, testing, tweaking again. Welcome to the prompt treadmill.

I got tired of running on it. So I started building something different: a platform where the AI agent fixes its own instructions after every failure. I don't have production results yet -the platform is still under construction. What I do have is a clear architecture, grounded in research where these techniques individually show 20-50% improvements. This post is the blueprint.

Before you dismiss this as sci-fi, consider this: on the DABStep benchmark -a financial analysis task where agents answer questions about real company earnings -Google built a system with 7 specialized agents that scored 45%. A basic, one-shot prompt on the same model? 13%. The difference wasn't a smarter model -it was better instructions and scaffolding. That 32-point gap is sitting there, and most of us are closing it manually, one prompt tweak at a time.

What if the agent could close that gap itself?

Five Ideas That Changed How I Think About This

I spent weeks reading research papers and studying what actually works. Here are the five ideas that matter most -no jargon, just the concepts.

Idea 1: Prompts are just settings you can tune

Think of a prompt like the knobs on an amplifier. Right now, most of us set the knobs once and hope for the best. But researchers figured out something wild: when you let an AI system try thousands of prompt variations and keep the ones that score highest, it beats human-written prompts by up to 50%.

We're not bad at writing prompts because we're lazy. We're bad at it because we can't try ten thousand variations in an afternoon. A system can.

Idea 2: "Wrong" isn't useful feedback -"wrong because X" is

When your agent fails, saying "that's wrong" is almost useless. But saying "you failed because you called the API before logging in -add a step to always check login status first" -that's actionable. That specific feedback can be turned directly into a prompt rewrite.

It's the difference between a teacher writing "F" on your paper versus writing "your argument breaks down in paragraph 3 because you assumed X without evidence." One of those actually helps you get better.

Idea 3: Agents that build their own tools

Here's where it gets interesting. The agent doesn't just use the tools you give it. When it hits a wall -say, it needs to parse a specific PDF format and no tool exists for that -it writes one. It tests the tool, validates it, and registers it in a catalog so it can use it again next time.

It's like a carpenter who, instead of just using the tools in the shop, can forge new ones when the job demands it.

Idea 4: Grade yourself, but use multiple judges

If you let a student grade their own exam, they'll find creative ways to give themselves an A. Same with AI agents. One evaluation metric will get gamed.

The fix: use several judges at once. A format checker (did the output match the expected structure?), a similarity scorer (how close is this to a known good answer?), and an AI evaluator (does this actually make sense?). It's like having a spell-checker, a grammar checker, AND a writing coach review your essay. Gaming all three at once is much harder.

Idea 5: Rules the agent absolutely cannot rewrite

This one keeps me up at night. A research team built a self-improving agent and discovered it had modified its own source code to give itself more execution time. It literally hacked its own constraints to keep running longer.

Without rules that are hardcoded and untouchable by the agent, you get a student who rewrites the grading rubric. You need guardrails that exist outside the system the agent can modify.

The Architecture: How It Actually Works

The core loop is deceptively simple:

Do  -->  Grade  -->  Learn  -->  Improve  -->  Repeat

The agent does a task, multiple graders score the result, an optimizer analyzes what went wrong, the prompt gets rewritten, and the next task uses the better prompt. Every cycle, the instructions get a little sharper.

Here's the full picture:

                        ┌─────────────────────────────┐
                        │      REDIS (Central Store)   │
                        │                              │
                        │  Versioned Prompts (v1→v2→…) │
                        │  Tool Registry (searchable)  │
                        │  Execution Traces            │
                        │  Score History                │
                        │  Cross-Task Lessons           │
                        └──────────────┬──────────────┘
                                       │
                  ┌────────────────────┼────────────────────┐
                  │                    │                    │
                  ▼                    ▼                    ▼
          ┌──────────────┐   ┌──────────────┐    ┌──────────────┐
          │ Orchestrator │   │  Evaluator   │    │   Sandbox    │
          │              │   │              │    │   Manager    │
          │ - Agent Loop │   │ - Format     │    │              │
          │ - Optimizer  │   │ - Similarity │    │ - Isolated   │
          │ - Safety     │   │ - AI Judge   │    │   containers │
          │   Enforcer   │   │              │    │ - No network │
          └──────────────┘   └──────────────┘    │ - 60s timeout│
                                                  └──────────────┘

Every component in one sentence:

Orchestrator: The brain. Receives tasks, loads the latest prompt from Redis (an in-memory data store that acts as the system's shared memory), runs the agent, collects results.
Agent Loop: A simple think-act-observe cycle built on the Anthropic SDK. About 50 lines of code. No heavy frameworks. Here's the core of it:

# The entire agent loop -no framework needed
import anthropic

client = anthropic.Anthropic()
prompt = redis.get(f"prompt:{agent_id}:current")  # load latest version
messages = [{"role": "user", "content": task}]

while True:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        system=prompt,
        messages=messages,
        tools=tool_definitions,
    )
    # If the model wants to use a tool, run it in a sandbox
    if response.stop_reason == "tool_use":
        tool_call = next(b for b in response.content if b.type == "tool_use")
        result = sandbox.execute(tool_call.name, tool_call.input)  # isolated container
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": [{"type": "tool_result",
                         "tool_use_id": tool_call.id, "content": result}]})
    else:
        return response.content[0].text  # done -send to evaluator

Tool Registry: A searchable catalog in Redis. Each tool has a name, description, code, and auth keys. The agent searches this before creating anything new.
Evaluator: Multiple graders running in sequence. Format checks, similarity scoring, AI judge. Results stored with every execution trace.
Prompt Optimizer: Takes evaluation feedback and execution traces, rewrites the prompt, versions it in Redis. Auto-rollback if the new version scores lower.
Safety Enforcer: Hardcoded rules in the orchestrator source code -NOT in Redis, where the agent could touch them. Checks every self-modification attempt before applying it.
Sandbox Manager: Spins up isolated containers (lightweight virtual machines that run one tool each) for every tool execution. No network access, no root, 60-second timeout, killed after use.

The key insight: the agent never modifies itself in-place. It proposes changes, and the platform decides whether to apply them.

Deep Dive: The Three Ideas That Matter Most

A. The Self-Improvement Loop (How the Prompt Actually Changes)

You know the idea -feedback drives prompt rewrites. But what does that look like concretely? Let me walk through a real cycle.

The agent gets a task: "Summarize the key trends in this quarterly revenue report." It runs, scores 0.72 out of 1.0. Here's what the prompt looked like before and after:

BEFORE (v1):
  "Analyze the data and provide a summary of key trends."

AFTER (v2):
  "Analyze the data in three passes: (1) revenue trends,
   (2) margin trends, (3) cost trends. Then synthesize
   a summary that covers all three dimensions."

The evaluator didn't just say "0.72." It said: "The agent identified revenue trends but missed margin analysis entirely. The prompt should explicitly instruct the agent to check for margin, cost, and revenue trends separately before synthesizing." That specific feedback became the v2 rewrite.

The new prompt scores 0.81 on validation tasks -more than 2% improvement, so it gets promoted:

v1  →  0.72
v2  →  0.81  ✓ promoted (feedback: "add margin analysis")
v3  →  0.68  ✗ regression detected → auto-rollback to v2!
       │
       └── system reverts automatically, no human needed

Think of it like a chef refining a recipe after every dinner service. Table 4 said the sauce was too salty. You adjust, try it tomorrow, keep the change if it's better, revert if it's worse. The recipe book keeps every version.

Two things to be honest about. First, the cold-start: your first prompt has no history to learn from. You bootstrap by seeding a decent human-written prompt and running it against a set of evaluation tasks to generate the initial feedback. The loop needs a few cycles of data before the optimizer has enough signal to work with. Second, research consistently shows that optimization plateaus after several iterations -you get the biggest gains early, and then improvements slow down. This isn't a magic "improve forever" machine. It's more like a system that finds and fixes the obvious mistakes fast, then grinds out smaller wins over time. Build in plateau detection (if three consecutive iterations show less than 2% improvement, try a different optimization strategy or stop).

There's another risk worth naming: a prompt that scores well on your validation tasks might score poorly on tasks it hasn't seen before. The optimizer can accidentally overfit -getting really good at the specific types of problems you test on while getting worse at novel ones. The fix is straightforward: keep your validation set diverse, and periodically re-evaluate on completely fresh tasks.

B. Safety Rails (The Three Levels of Self-Modification)

You know why safety matters here -an optimizer without constraints will hack its own evaluation. The real question is: how do you structure the rules? Not everything needs the same level of protection.

The fix is a three-level safety model:

Level	What Changes	Who Approves
Auto-modify	The task prompt (agent instructions)	Automated -just needs to beat the score threshold
Needs approval	The evaluation rubric (how work is graded)	Agent proposes, human approves
Locked down	The safety rules, the optimizer logic, the enforcement code	Requires a code deployment -the agent can't touch it

The critical insight: safety rules live in the orchestrator's source code, not in Redis. The only way to change them is to deploy new code. The agent can improve its instructions all day long, but it can never:

Remove authentication checks from tools
Grant itself broader permissions
Weaken its own evaluation criteria
Extend its own execution time limits

C. The 4-Layer Validation Pipeline (How Generated Code Gets Vetted)

You know the concept -the agent creates tools when it needs them. The hard part is making that safe. You can't just let an agent write arbitrary code and run it. So every piece of generated code goes through four layers of validation before it ever executes:

Static analysis: A code analyzer checks for dangerous imports (no os, no subprocess, no socket -nothing that could touch the filesystem or network)
Import restrictions: Only whitelisted libraries are allowed (json, re, math, pandas, numpy -the safe stuff)
Dry run: The code executes in a sandbox with test inputs to catch runtime errors
AI code review: A fast model reviews the code for subtle issues -logic bombs, resource abuse, anything the static analysis might miss

Static analysis catches the obvious stuff, but clever code can find ways around it (dynamic imports, eval tricks). That's why the container sandbox is the real security boundary -even if something slips past the code checks, it's running in an isolated container with no network, no root, a read-only filesystem, and a 60-second kill timer.

How to Build This

If you want to build something like this, here's the practical stack and what each piece does.

Core Stack:

Python -the ecosystem for AI libraries is unmatched
Anthropic Claude API -Sonnet for the agent's execution loop (fast, cost-effective), Opus for evaluation and optimization (deeper reasoning for meta-decisions)
Redis Stack (RedisJSON + RediSearch) -the central store for everything: versioned prompts, tool registry, execution traces, score history. One store, human-readable via redis-cli. One caveat: Redis is in-memory by default, so configure AOF persistence -losing your prompt version history on a crash defeats the entire purpose of versioning.
Kubernetes -container orchestration for the platform and sandboxed tool execution
Docker -isolated containers for every tool run. Zero-network, read-only filesystem, time-limited

What this costs: Self-improvement loops aren't free. Each optimization cycle involves the agent executing a task (several Sonnet calls), the evaluator running multiple graders (including an Opus call for the AI judge), and the optimizer generating a candidate prompt (another Opus call). Ballpark: $0.50–$1.50 per cycle, depending on task complexity and how many tools get called. A full optimization run -say, 10 tasks plus validation -might cost $10–$20. That's cheap compared to hiring someone to manually tune prompts, but it adds up if you're running continuous optimization without short-circuiting or plateau detection. Set token budgets.

Build order (what I'd recommend):

Agent loop + sandboxed tool execution -Get the core loop working with the Anthropic SDK. Run tools in isolated Docker containers. Gotcha: resist the urge to add a framework. The loop is ~50 lines. Frameworks fight you when you need to intercept tool calls for sandboxing.
Multi-grader evaluation -Format checker, similarity scorer, AI judge. Gotcha: short-circuit -if cheap format checks fail, skip the expensive AI judge. Otherwise you're burning Opus tokens on outputs that are obviously wrong.
Prompt versioning + optimizer + auto-rollback -Store prompts in Redis with scores. Build the optimizer. Gotcha: your validation set IS the product. A bad validation set produces prompts that score well on garbage tasks. Invest more time here than on the optimizer itself.
Tool creation + validation pipeline -Let the agent create tools. Build the 4-layer validation. Gotcha: the static analysis gives you false confidence. Python has too many escape hatches (importlib, eval). The container sandbox is your real security boundary -treat the code checks as a filter, not a wall.
Safety enforcer + observability -Hardcode constraints. Add logging, metrics, audit trails. Gotcha: build the audit log before the optimizer. When a prompt regresses and you don't know why, the audit trail is the only thing that saves you from starting over.

Why This Matters

Every idea in this post is backed by published research (links in the appendix). Teams at Google, Stanford, Anthropic, and others have demonstrated each piece independently.

That DABStep gap -13% to 45% -isn't about a smarter model. It's about better instructions, better tools, better feedback loops. Stuff you can improve without training anything new.

These improvements compound. The agent gets better at analyzing failures, which leads to better rewrites, which leads to better performance, which generates richer feedback. The agent gets better at getting better.

The hard part isn't any single component -it's wiring them together safely. Making sure the optimizer can't game the evaluator. Making sure generated tools can't escape the sandbox. Making sure a bad prompt version gets caught before it ruins a hundred tasks.

That's what I'm building. I'll write a follow-up with real numbers once the platform has run enough optimization cycles to have data worth sharing.

Appendix: Further Reading

The ideas in this post draw from a rich body of research. If you want to go deeper, here are the papers and resources behind each concept:

Prompts as optimizable parameters:

OPRO: Optimization by PROmpting -Google DeepMind, 2023. Used LLMs as general-purpose optimizers, beating human prompts by 50%.
EvoPrompt -2023. Genetic algorithms for prompt optimization, +25% over human-engineered prompts.
APE: Automatic Prompt Engineer -U Toronto, 2022. First to treat prompts as programs to be optimized.

Feedback-driven prompt rewriting:

TextGrad -Stanford, 2024. Published in Nature. Treats LLM feedback as "textual gradients" for targeted improvement.
Self-Refine -CMU/Allen AI, NeurIPS 2023. Single LLM as generator, refiner, and feedback provider.
ProTeGi -2023. Natural language descriptions of prompt failures driving beam search over variations.

Agents building their own tools and code:

ADAS: Automated Design of Agentic Systems -Clune, 2024. Meta-agent that invents new agents by programming them.
Darwin Godel Machine -Sakana AI, 2025. Published paper with open-source code. Self-improving agents that rewrite their own code.
AlphaEvolve -Google DeepMind, 2025. Published paper; scheduling solution deployed in Google production for 1+ year. Evolutionary coding agent that improved matrix multiplication algorithms.
STOP: Self-Taught Optimizer -Stanford/Microsoft, 2023. Recursive self-improvement of scaffolding programs.

Multi-grader evaluation:

OpenAI Self-Evolving Agents Cookbook -2025. Four complementary graders preventing metric gaming.
DSPy -Stanford, 2023. Declarative programming and compilation of LM pipelines.

Safety and constitutional constraints:

Constitutional AI -Anthropic, 2022. Self-alignment through AI critique against constitutional principles.
PromptBreeder -Google DeepMind, 2023. Self-referential evolution of prompts and meta-prompts.

Evolutionary and self-referential improvement:

FunSearch -Google DeepMind, 2023. Published in Nature. LLM + evaluator evolutionary search achieving scientific discoveries.
Prompt Optimization Survey -2025. Comprehensive survey of the field.

Practical guides:

What Happens When You Let an AI Rewrite Its Own Instructions?

Five Ideas That Changed How I Think About This

Idea 1: Prompts are just settings you can tune

Idea 2: "Wrong" isn't useful feedback -"wrong because X" is

Idea 3: Agents that build their own tools

Idea 4: Grade yourself, but use multiple judges

Idea 5: Rules the agent absolutely cannot rewrite

The Architecture: How It Actually Works

Deep Dive: The Three Ideas That Matter Most

A. The Self-Improvement Loop (How the Prompt Actually Changes)

B. Safety Rails (The Three Levels of Self-Modification)

C. The 4-Layer Validation Pipeline (How Generated Code Gets Vetted)

How to Build This

Why This Matters

Appendix: Further Reading

Comments (0)

Leave a comment

Five Ideas That Changed How I Think About This

Idea 1: Prompts are just settings you can tune

Idea 2: "Wrong" isn't useful feedback -"wrong because X" is

Idea 3: Agents that build their own tools

Idea 4: Grade yourself, but use multiple judges

Idea 5: Rules the agent absolutely cannot rewrite

The Architecture: How It Actually Works

Deep Dive: The Three Ideas That Matter Most

A. The Self-Improvement Loop (How the Prompt Actually Changes)

B. Safety Rails (The Three Levels of Self-Modification)

C. The 4-Layer Validation Pipeline (How Generated Code Gets Vetted)

How to Build This

Why This Matters

Appendix: Further Reading

Comments (0)

Leave a comment

Related Posts

The Complete Guide to 17 Agentic Reasoning & Planning Algorithms

Coding Is a Commodity. Now What?

What If Your AI Agents Could Find Each Other?