Cersei

AgentRL Overview

AgentRL — a self-evolving orchestration layer for Cersei. Agents that trace their own failures, propose fixes in sandboxes, and register the winners as reusable, programmable tools.

AgentRL

AgentRL turns Cersei from a single-shot coding agent into a self-evolving one. Give it a task. If the agent solves it, you're done. If it fails, AgentRL traces why, asks a planner for directed fixes, runs those fixes as throwaway sub-agents in isolated sandboxes, promotes the one that actually passes verification, and registers it as a reusable tool. The next time a similar problem shows up, the agent finds that tool and skips straight to the solution.

Flagship feature. AgentRL is the orchestration layer that ties together the rest of Cersei — the agent runtime, the sandbox system (cersei-vms), memory, embeddings, and a small programmable language (AgentTemplate). It ships as two crates: cersei-agentrl (the loop) and cersei-agentlang (the DSL).

The idea in one loop

            ┌─────────────────────────────────────────────────────────┐
  task ───▶ │  registry.search(task)  ──▶  GeneralAgent (+ cached tools)│
            └─────────────────────────────────────────────────────────┘

                    success?  ─────┴───── no ───▶  ExecutionGraph.failure_trace()
                       │                                   │  (directionality)
                      yes                                  ▼
                       │                          PlannerAgent → N proposals
                       ▼                                   │
                  ✅ done                    fan out into isolated sandboxes

                                          verify each ─── winner ──▶ promote

                                          register winner as a reusable Tool
                                          + record the problem in memory

                                                   ✅ done (ByNewTool)

Every box is a real, swappable component:

  • ExecutionGraph — a DAG of turns and tool calls, built passively from the agent's event stream by a Reporter. On failure it distills a FailureTrace that tells the planner exactly what to fix.
  • GeneralAgent / PlannerAgent — not new types; ordinary Agents configured with different prompts and toolsets.
  • Sandboxes — proposals run in isolated working directories (or cersei-vms sandboxes) so parallel attempts never trample each other.
  • ToolRegistry — a local, persisted, searchable database of agent-built tools. Lookup-before-build; register-on-win.
  • Verifier — an independent check (cargo test, a script, anything) that decides "did this actually work?" The agent can't game it.

Why it matters

A normal agent re-derives the same fix every time it hits a recurring problem class. AgentRL remembers solutions as executable tools, so it gets cheaper and more capable over time:

Self-improving

Solved problems become DynamicTools in a registry. Similar future tasks are solved by recall, not re-derivation — no planner, no sandboxes, no extra LLM spend.

Safe by construction

Recovery attempts run in isolated sandboxes. Untrusted, parallel, and disposable — only the verified winner is promoted to your real working directory.

Directed recovery

Failures aren't retried blindly. The ExecutionGraph extracts a scrubbed, ordered failure trace that gives each proposal concrete directionality.

Programmable

Sub-agents can be authored in AgentTemplate — a tiny functional language (io.read().write(), agent.send(), agent.tools.register()) that LLMs emit and the runtime executes safely.

Programmable end to end

AgentRL is built to be driven, not just configured. Two seams make it fully programmable:

  1. The AgentRlRunner trait is the mechanism — how a general agent runs, how proposals are generated, how each runs in a sandbox. Use the batteries-included CerseiRunner (real Agent + Provider), or implement the trait yourself.
  2. The AgentTemplate language lets the model write its own tools. Cersei can't easily rewrite its own runtime — but an LLM can write a short template program that the runtime executes on top of the existing tools, then register it for reuse.

60-second example

use cersei::prelude::*;
use cersei::agentrl::{CerseiRunner, CommandVerifier, Orchestrator, ToolRegistry, Verifier};
use cersei::provider::Gemini;
use std::sync::Arc;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // A persisted, searchable database of agent-built tools.
    let registry = ToolRegistry::open(dirs::home_dir().unwrap().join(".cersei/agentrl"))?;

    // An INDEPENDENT verifier — the agent cannot cheat this check.
    let verifier: Arc<dyn Verifier> =
        Arc::new(CommandVerifier::new("python3 gcd.py 48 36 | grep -qx 12"));

    // A fresh provider per agent (sub-agents get their own client).
    let key = std::env::var("GEMINI_API_KEY")?;
    let provider_factory = Arc::new(move || Box::new(Gemini::new(key.clone())) as Box<dyn Provider>);

    // The production runner: real agents + provider + sandboxed proposals.
    let runner = Arc::new(
        CerseiRunner::new(provider_factory, "./work", registry.clone(), verifier)
            .with_model("gemini-3.1-pro-preview") // required for non-Anthropic providers
            .with_max_turns(16),
    );

    let orchestrator = Orchestrator::new(runner, registry);
    let outcome = orchestrator
        .solve("Create gcd.py: takes two ints, prints their GCD via the Euclidean algorithm.")
        .await?;

    println!("solved={} how={:?}", outcome.solved, outcome.how);
    Ok(())
}

Model selection. cersei-agent defaults to an Anthropic model when none is set. For Gemini, OpenAI, or any other provider you must call .with_model(...) on the runner (or .model(...) on the builder), or requests will fail.

What runs when (observed)

From the live end-to-end tests against Gemini:

ScenarioPathWhat happened
Capable agent, solvable taskSolved::DirectlyGeneralAgent wrote gcd.py in one run (recovering from 2 tool errors mid-run). The verifier passed. No planner, no sandboxes.
First attempt failsSolved::ByNewTool(id)GeneralAgent failed the verifier → trace → planner proposed 2 fixes → both ran in isolated sandboxes → the winner was promoted and registered as a reusable tool.
Similar task, second timeSolved::ByCachedTool(id)registry.search surfaced the previously-built tool; the agent solved it via recall. The planner never ran.

Install

AgentRL is behind a feature flag (it pulls in the embeddings + sandbox crates):

[dependencies]
cersei = { version = "0.1", features = ["agentrl"] }
tokio = { version = "1", features = ["full"] }
anyhow = "1"

This enables cersei::agentrl (the loop) and cersei::agentlang (the DSL), and turns on the vms sandbox feature.

Explore

On this page