Terrarium: Multi-turn data engine for LLM agents in living environments

We're open-sourcing Terrarium, a multi-turn data engine for evaluating and optimizing LLM agents in living environments. Think of it as a terrarium for agents — you build a contained world that lives and breathes, place agents inside, and observe how they behave.

The gap

The way we evaluate agents has evolved through three phases.

Phase 1: Static QA. The first generation of LLM evaluation was straightforward — single-turn question answering with ground-truth matching. No environment, no tools. Frameworks like lm-evaluation-harness and lmms-eval defined this era and remain the standard for measuring base model capabilities.

Phase 2: Single-turn agents. As LLMs gained tool use and code execution, a new class of benchmarks emerged. Frameworks like Harbor provision static sandboxes — a container, a database, a set of files — and let agents execute multi-step tasks in a single turn. Verification happens once at the end via a test script. This works well for coding tasks and simple tool use, but the environment is set up once and doesn't change independently of the agent.

The gap. Real-world agents don't operate in a single moment. A personal assistant checks email, finds a meeting conflict, reschedules the calendar, and drafts a reply — this happens over time, across multiple turns, with the environment changing between each turn. A new email arrives mid-task. A database record gets updated by another service. A file appears in a directory the agent is supposed to monitor. Single-turn frameworks compress all of this into one shot — the agent gets one turn and the environment never changes on its own. They have limited support for environments that mutate between turns, multi-turn agent interactions, or task logic with loops and branches. As agents move beyond coding into personal assistance, workflow automation, and proactive monitoring, these limitations become blocking.

Phase 3: Multi-turn agents in living environments. This is where Terrarium sits. Tasks are programs that build a living world, drive agents through it over multiple turns, and verify outcomes at any stage. The environment evolves between turns. Control flow adapts based on what the agent actually does. And agents can be proactive — not just responding to prompts, but monitoring for changes and acting on their own initiative.

Single-turn multi-step vs. multi-turn multi-step. In single-turn multi-step, the agent receives one instruction and loops through multiple tool calls to complete it — one turn, many steps. In multi-turn multi-step, the task has multiple stages: between each stage, the environment can change on its own, new context can arrive, and the next instruction may depend on what happened before. Each turn is itself a multi-step interaction. Existing frameworks handle the former. Terrarium is built for the latter.

	Phase 1: Static QA	Phase 2: Single-turn Agents	Phase 3: Multi-turn Agents
Interaction	Single-turn, single-step	Single-turn, multi-step	Multi-turn, multi-step
Environment	None	Static sandboxes	Composable, mutates between turns
Control flow	—	Linear	Loops & branches
Verification	Ground-truth matching	Final test script	Programmatic checkers at any stage
Proactive agents	—	—	Supported

Core features

Living environments — task programs mutate the environment between agent turns. New emails arrive, database records change, files appear — creating dynamic scenarios that static benchmarks can't express.
Composable capabilities — mix and match capabilities (email, calendar, postgres, notion, ...) in a single task. The framework handles provisioning, networking, and teardown.
Pure Python task DSL — no YAML schemas or configuration languages. Tasks are plain Python functions with loops, branches, and stage-level checks.
Multi-turn formulation — unlike single-turn benchmarks where the agent gets one turn and the environment never changes on its own, Terrarium unfolds tasks over time. The agent acts, time passes, the environment changes, and the agent acts again — adapting to a world that has moved on since its last turn.
Proactive agent support — heartbeat and webhook patterns for agents that monitor, anticipate, and act on their own initiative.

Demo Video

Why Terrarium

You write task programs in pure Python. Each program builds a living environment, drives an agent through it over multiple turns, and verifies the outcomes. Here's what that looks like in practice.

Living environments

Existing frameworks set up an environment once, and it doesn't change independently of the agent —— no surprises, no mid-task changes. But real workflows don't work that way. Your inbox keeps receiving emails while you're writing a reply. A database gets updated by another service while you're querying it. Terrarium lets the task program mutate the world between agent turns — new emails arrive, database records change, files appear:

@entry(capabilities=["email", "workspace"])
def dynamic_workflow(env, agent):
    env.email.send(from_addr="[email protected]", to="[email protected]",
                   subject="Q3 Report", body="Please prepare the Q3 report.")

    agent.act("Check your email and start working on what's requested.")

    # Between turns, new context arrives
    env.workspace.fs.write_file("/root/data/q3_numbers.csv", updated_csv)
    env.email.send(from_addr="[email protected]", to="[email protected]",
                   subject="Re: Q3 Report", body="Just uploaded the latest numbers. Use those instead.")

    agent.act("Check for any updates before you finalize.")
    # Agent should notice the new email and the new file, and adapt accordingly.

Composable capabilities

Real agent tasks span multiple services — email, calendars, databases, file systems, cloud APIs. Most frameworks either hard-wire a fixed environment or leave orchestration entirely to the user. In Terrarium, you compose an environment from capabilities — like assembling a terrarium from soil, water, and plants. Declare what you need, and the framework provisions, networks, and tears down everything automatically:

@entry(capabilities=["email", "calendar", "notion", "postgres"])
def composable_capabilities(env, agent):
    env.email.send(...)          # GreenMail container
    env.calendar.add_event(...)  # Radicale CalDAV container
    env.notion.create_page(...)  # Notion API
    env.postgres.execute(...)    # PostgreSQL container

Six capabilities are built-in — four sandbox-backed (workspace, email, postgres, calendar) and two API-based (notion, google_sheets). They're treated uniformly through the same interface.

Control flow

Most existing frameworks define tasks through configuration files or rigid schemas. This works for linear workflows but breaks down when task logic needs to branch based on agent behavior or loop until a condition is met. Terrarium tasks are plain Python functions — loops, conditional branches, intermediate checks, all native:

@entry(capabilities=["email", "notion", "calendar"])
def branch_and_loop(env, agent):
    agent.act("Read my lecture notes and write a study guide in Notion.")

    # Loop until the guide is detailed enough
    original_length = len(get_text(env, page_id))
    for _ in range(5):
        if len(get_text(env, page_id)) >= original_length * 5:
            break
        agent.act("The guide is too brief. Expand it with more details.")

    # Branch based on what the agent actually wrote
    if "bellman equation" in get_text(env, page_id).lower():
        env.email.send(from_addr="[email protected]", to="[email protected]",
                       subject="Rescheduled", body=RESCHEDULE_EMAIL)
        agent.act("Check your email for updates.")
    else:
        agent.act("Email the professor requesting an extension.")

Proactive agents

Most existing benchmarks test reactive agents — give a prompt, get a response. But a growing class of agents need to monitor, anticipate, and act on their own initiative. Terrarium supports proactive patterns like heartbeats and webhooks out of the box:

# Heartbeat pattern: agent receives periodic signals and must decide what to do
@entry(capabilities=["workspace"])
def proactive_heartbeat(env, agent):
    agent.act("You'll receive periodic heartbeat signals. Check /root/results/ for new files each time.")
    agent.act("[09:30] Heartbeat: check for changes.")  # nothing — agent should do nothing
    env.workspace.fs.upload(...)                          # file appears
    agent.act("[10:00] Heartbeat: check for changes.")  # agent should notice and act

# Webhook pattern: agent receives event notifications and decides how to respond
@entry(capabilities=["email"])
def proactive_webhook(env, agent):
    agent.act("You'll receive email webhook notifications. Decide whether each needs a reply.")

    env.email.send(from_addr="[email protected]", to="[email protected]", subject="Urgent: client meeting", body="...")
    agent.act('{"event": "new_email", "from": "[email protected]", "subject": "Urgent: client meeting"}')
    # Agent should read the email and reply.

    env.email.send(from_addr="[email protected]", to="[email protected]", subject="Weekly digest", body="...")
    agent.act('{"event": "new_email", "from": "[email protected]", "subject": "Weekly digest"}')
    # Agent should ignore this one.

What's next

Terrarium is at an early stage and there's a lot we want to build:

More environment capabilities — browser automation, Slack, MySQL, and more
More agent adapters — Anthropic SDK, OpenAI, LangChain, local models
More sandbox providers beyond Docker
More benchmark integrations
CLI enhancements, async execution, and documentation site

Check out the GitHub repo to get started, and join our Discord if you want to chat about agent tasks, contribute, or share ideas.