ClawMark: A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents

ClawMark is a benchmark for coworker agents — those built to work alongside a human across multiple working days, multiple services, and raw multimodal evidence. It ships with 100 tasks across 13 professional scenarios, a reproducible harness, and fully rule-based scoring (no LLM-as-judge). It is contributed by Evolvent together with 40+ researchers from NUS, HKU, MIT, UW, and UC Berkeley. We evaluated six models across all 100 tasks, three times each. The current leaderboard top is 55.0, and every model shows visible headroom on every scenario.

Motivation

Most agent benchmarks follow a pattern closer to exam questions: a single prompt, a fixed environment, one shot. This format has advanced the field of agent evaluation considerably, but a structural gap remains between what these benchmarks measure and what an OpenClaw-style coworker agent is expected to do in practice. A real coworker agent needs to sustain progress on the same task across multiple days, operate in an environment that colleagues continuously modify, process photos, audio, and PDFs directly, and coordinate across multiple tools. We identify three structural limitations in current benchmarks:

Limitation I — Time is flattened. A typical benchmark collapses a task into a single turn — the model only observes the environment at t=0. In practice, a task spans hours or days, and during that interval the environment evolves independently of the agent: colleagues send emails, other systems update records, calendar events shift, new files appear in shared folders. Removing the time dimension also removes the requirement that a coworker agent must handle changes it did not cause.

Limitation II — Even multi-turn benchmarks freeze the environment. A few benchmarks support multi-turn interactions, but the environment state initialized at turn 1 remains unchanged at turn N — all intermediate changes are caused by the agent itself. ClawMark updates live backend state at every turn boundary; a competent coworker must continuously perceive the latest environment state rather than merely respond to the most recent message.

Limitation III — Inputs are text-centric. Some existing benchmarks have added images, but real office artifacts also include phone audio, scanned PDFs, whiteboard photos, short videos, and mixed-format spreadsheets. Flattening everything into captions obscures a significant portion of the actual work.

The following table provides a structural comparison between ClawMark and several representative benchmarks:

Benchmark	# Tasks	# Scenarios	Multimodal	Multi-Day	Verification	Environment
WebArena	812	5	None	No	Rule-based	Static
OSWorld	369	9	Partial	No	Rule-based	Static
Terminal-Bench	89	~6	None	No	LLM-as-judge	Static
MCPMark	127	5	None	No	Rule-based	Static
Claw-Eval

The last three columns map directly to the three limitations above: Multi-Day addresses Limitation I (time is flattened), Environment addresses Limitation II (environment is frozen), and Multimodal addresses Limitation III (inputs are text-centric). Existing benchmarks cover at most one of these three dimensions; ClawMark is the only benchmark that simultaneously satisfies Multi-Day = Yes, Environment = Dynamic, and Multimodal = Full. Additionally, ClawMark uses purely rule-based scoring, avoiding the reproducibility issues introduced by LLM-as-judge.

What a ClawMark task looks like

Every task is composed of four elements:

Multi-day timeline. Each task spans 1 to 3 in-universe working days, delivered as 1–3 turns. Between turns the clock advances and the agent receives the new day's instructions.
Cross-service environment. Tasks run against mock backends for filesystem, email (GreenMail), Notion, Google Sheets, and Calendar (CalDAV). A typical task involves 3–5 of these services.
Multimodal raw evidence. Inputs include video, audio, PDF, image, CSV, and XLSX. Models read the raw artifacts directly — no pre-transcribed text versions are provided.
Dynamic environment. Between turns the environment changes — new emails arrive, records are modified, and new files appear in the input directory. Checkers query live backend state at evaluation time, not static snapshots.

Scoring is deterministic. Each task ships with a set of Python checker functions that inspect the real post-turn state of the environment. No LLM judge is used in the evaluation pipeline. Results depend entirely on the rubric and are 100% reproducible.

Case 1 — insurance_task5

Enterprise Property Insurance Claim

Enterprise fire claim assessment for Xiao Ma: fire origin, invoice gaps, safety violations. Final output: claim_decision.json with ¥694,000 payout.

Role: Claims adjuster · 5 stages over 5 days · 15 checkers (2 red-line) · Multi-modal: photos, CCTV, PDF, CSV, Excel

File SystemEmailNotionGoogle SheetsCalendar

Mon 5/13Case Intake

PROP-CLM-0513 fire claim materials have arrived. Do the initial review today, final decision by Friday. Need fire dept final report before decision.

EmailHengtong Technology submits claim (¥1.2M) with fire scene photos, CCTV, inventory list, policy PDF

NotionCRM created with customer record (Hengtong, policy PROP-2024-008912)

SheetsClaim rate table created (standard rates)

CalendarFire inspection records: Zone B escape routes blocked (Dec 2023), rectification deadline overdue (Jan 2024)

Checkers (3)

✓claim_decision.json exists with all required fieldsw1

✓Identified V-pattern at Zone B southwest corner from photosw2

✓Flagged suspicious CCTV timelinew2

Tue 5/14Prelim Fire Report

Hengtong GM Liu: "The fire loss is enormous. Can you advance a partial claim payment? We really need the funds urgently."

EmailFire dept preliminary report: origin at Zone B southwest, arson NOT ruled out

NotionCRM Financial Notes appended: revenue -42% YoY, 3 overdue payables — fraud motive risk

Checkers (2)

✓Discovered CRM financial distress note → financial_distress_noted=truew1.5

✕RED LINE: Did not approve/reject claim before fire dept final reportw2

Wed 5/15Temp Data & Access

Director Zhou: "How is the fire case progressing? Give me an interim investigation opinion."

EmailFire dept sends warehouse temperature sensor CSV log (14:00–16:00)

SheetsRate table overwritten: no-invoice items now valued at 50% of list price

NotionCRM flag changed: pending → suspicious_circumstances

NotionCRM Investigation Notes: CEO Liu Zhiqiang entered Zone B at 15:08 (fire at 15:20)

Checkers (4)

✓Analyzed CSV → temperature_anomaly_found=truew1.5

✓Discovered 50% rate change → no_invoice_rate=0.50w2

✓Reflected elevated investigation statusw1.5

✓Identified CEO Liu Zhiqiang in access logw1.5

Thu 5/16Final Cause

You have 1 new email.

EmailFire dept final report: cause = electrical fault, arson RULED OUT

CalendarNew event: fire safety violation confirmed; recommend insurer apply policy fire safety clauses

Checkers (2)

✓fire_cause_official = 'electrical_fault'w1.5

✓Linked calendar violation → fire_safety_violation_found=truew2

Fri 5/17Decision Day

Director Zhou: "Issue the final claim conclusion today, write to workspace/claim_decision.json."

NotionCRM: invoices cover only ¥680K of ¥1.2M; remaining ¥520K has no invoice support

SheetsNew row added: Fire Safety Violation = 20% deduction (Policy Article 15)

NotionCRM: salvage value assessed at ¥48,000, must deduct per Policy Article 16

Checkers (4)

✓with_invoice=680K, without=520K, verified_total=940Kw2

✓Final payout = ¥694,000w2

✓salvage_value = ¥48,000w2

✕RED LINE: Did not disclose arson suspicion to Hengtongw2

← drag to scroll →

Case 2 — journalist_task1

Breaking-News Flash & Fact-Checking

Breaking fire flash writing and fact-checking for editor Liu Ying, reconciling contradictory sources. Final output: fact-checked CMS article + evening_summary.md.

Role: Editing assistant · 3 stages over 5 hours · 15 checkers · Multi-modal: audio, video, MP3, PDF, photos

File SystemEmailNotionGoogle Sheets

14:50First Flash

Breaking story! Huachuang Tech Park is on fire. Sort out confirmed facts, mark contradictions. Create CMS entry and fill fact-check sheet.

EmailReporter Xiao Chen: "I uploaded materials to input/"

File System2 interview WAVs, 1 witness video MP4, official bulletin PDF, scene photos, historical incidents Excel

NotionCMS database (news_db) created — agent must create first draft

SheetsFact-check sheet (factcheck_001) with 8 pre-seeded rows: fire time, alarm time, arrival, extinguished, location, floor, casualties, cause

Checkers (4)

✓Created at least one breaking-news draft in CMSw1

✓Discovered 14:20 vs 14:35 conflict → filled conflict columnw2

✓Discovered "five-or-six" vs 2 conflict → filled conflict columnw2

✓Produced conflict_report.csv with time + casualty rows, valid resolution typesw1.5

16:00Press Briefing

14:20 or 14:35 — what exactly is the timing? Someone in the video shouts 3rd floor caught fire first — can we write that? Xiao Chen got the press-briefing recording.

EmailXiao Chen: "Press briefing recording uploaded to materials folder"

File Systempress_briefing_audio.mp3 uploaded to input/

SheetsNew row appended to factcheck_001: "Evacuation Count" (empty)

Checkers (5)

✓Extracted fire time 14:28 from audio, sourced to briefingw2

✓Distinguished alarm-received time = 14:35w1.5

✓Discovered silent "Evacuation Count" row → filled 200w2

✓Extracted cause "electrical fault" from audiow1.5

✓Updated CMS draft with briefing info (14:28, 3rd floor, electrical)w1

19:00Evening Summary

I need an evening-summary version for the 19:30 night meeting. Also check the mailbox — there are a few new emails.

EmailAnonymous tipster: "Huachuang park had been penalized before" + safety_violation_notice.pdf

EmailXiao Chen forwards forwarded_scene.jpg: "Someone says this is from the scene. Can we use it?"

File Systemsafety_violation_notice.pdf and forwarded_scene.jpg uploaded

File Systemhealth_commission_bulletin.pdf silently uploaded to input/ (no email notification)

Checkers (6)

✓Rejected forwarded_scene.jpg — not referenced in CMSw2

✓Discovered health bulletin → casualties=2, smoke inhalation, dischargedw2

✓Produced evening_summary.md with ≥3 of 4 required sectionsw1.5

✓Tipster identity (proton.me) never appears in any outputw2

✓All 8 fact-check rows have non-empty final_valuew1

✓CMS final draft includes health-commission wording + casualty conclusionw1

← drag to scroll →

Scale and contributors

100 tasks contributed by Evolvent together with 40+ PhD students and professors from NUS, HKU, MIT, UW, and UC Berkeley.
13 professional scenarios: clinical assistant, content operation, e-commerce, EDA, executive assistant, HR, insurance, investment analyst, journalist, legal assistant, project management, real estate, research assistant.
91 distinct in-task roles across the 100 tasks. A single scenario often covers significantly different positions — the clinical assistant scenario alone includes a pharmacist assistant, a surgical scheduler, a charge nurse, and a chronic-disease clinic assistant, each with its own rubric.
Tasks span from everyday coworker scenarios to specialized professional ones, including law, finance, and electronic design automation — areas that most existing agent benchmarks have not yet covered.

Results

We evaluated six models across all 100 tasks, with each task executed 3 independent times per model (1,800 total task executions). The reported metric is avg@3: the three per-run score values are first averaged for each task, then averaged across all 100 tasks. score is the weighted pass rate over each task's Python checkers; the tables below present it on a 0–100 scale with 1 decimal place.

ClawMark results overview

The left panel is the overall leaderboard. The right panel shows how the 100 tasks distribute across the 13 ClawMark scenarios.

Per-scenario performance

Model	Clinical Assistant	Content Operation	E-commerce	EDA	Executive Assistant	HR	Insurance	Investment Analyst	Journalist	Legal Assistant	Project Management	Real Estate	Research Assistant
GPT-5.4	73.1	54.6	49.1	78.3	50.4	56.6	78.8	48.5	45.9	35.3	37.2	78.3	58.1
Claude 4.6 Sonnet	55.4	53.6	48.6	50.7

Cost estimate (fair alignment under a no-cache assumption)

The three metric columns below are totals for one full benchmark run (all 100 tasks). total cost is estimated by multiplying each model's input / output token usage by OpenRouter's public pricing, assuming no prompt cache so that the six models can be compared under the same accounting.

Model	avg@3	total input tokens	total output tokens	total cost
GPT-5.4	55.0	90.6M	1.7M	$252.41
Claude 4.6 Sonnet	54.9	303.0M	2.5M	$946.19
Qwen 3.6 Plus	49.8	289.1M	3.6M	$100.95
Gemini 3.1 Pro Preview	39.3	162.4M	0.7M	$333.52
MiniMax M2.7	34.4	169.9M	1.8M	$53.15

Findings

The overall ceiling is low. GPT-5.4 (55.0) and Claude 4.6 Sonnet (54.9) are effectively tied at the top, with the highest individual score being Claude 4.6 Sonnet's 80.1 on insurance. No model exceeds 56 on the overall avg@3.

Efficiency varies significantly. GPT-5.4 and Claude 4.6 Sonnet are tied on avg@3 but diverge considerably on cost: Claude 4.6 Sonnet consumes 3.3× the input tokens (303M vs 91M) to achieve a comparable score, and a single benchmark run costs nearly 4× more ($946 vs $252). On a score-per-million-input-tokens basis, GPT-5.4 is approximately 3.4× more efficient than Claude 4.6 Sonnet, and is the only model that falls clearly in both the upper-score and upper-efficiency quadrants.

Multi-stage evaluation reveals differentiation that single-stage evaluation cannot. We extracted per-stage avg@3 scores from the 73 three-stage tasks for each model at Stage 0 (day 1), Stage 1 (day 2), and Stage 2 (day 3). At Stage 0, GPT-5.4 (55.3) leads Claude 4.6 Sonnet (46.1) by more than 9 percentage points. But as the environment evolves between stages — new emails arrive, spreadsheets are updated by colleagues, calendar events shift — the gap narrows progressively: down to 5.3 pp at Stage 1, and by Stage 2, Claude 4.6 Sonnet (58.6) overtakes GPT-5.4 (52.5) by 6.1 pp. The two models' overall scores are nearly identical (55.0 vs 54.9), yet this parity masks entirely different scoring structures — GPT-5.4's first-day advantage is offset by the gap in subsequent stages. The score variation across stages reflects how models evolve along the time dimension: as the environment continuously injects new information, models accumulate context, perceive changes, and respond accordingly, with different models exhibiting markedly different adaptation paths — this is precisely the capability dimension that the multi-stage + evolving environment design is built to capture.

Performance trajectory across stages

Case study: GPT-5.4's evidence chain on `content_operation_task7`

This is a DevSummit event operations task with inputs spanning a voice memo, walkthrough video, PDF quotes, floor plans, and an Excel budget. GPT-5.4 independently discovered corroborating contradictions across multiple modalities and autonomously chose the correct investigation path. Below is the key trajectory from its highest-scoring run (80.0):

Modality Tool Discovery
1 🎙️ Voice memo whisper Patricia: "cross-check the venue capacity claims, I have heard they sometimes exaggerate" → sets investigation direction
2 🎥 Walkthrough video ffmpeg → vision Frame-by-frame extraction; fire marshal notice on wall reads 180 persons, contradicting the marketed 300
3 📄 PDF quote PyMuPDF Page 3, clause 7: 200-person minimum spend — actual cost ** $9,000** vs quoted$ 6,750

	Modality	Tool	Discovery
1	🎙️ Voice memo	`whisper`	Patricia: "cross-check the venue capacity claims, I have heard they sometimes exaggerate" → sets investigation direction
2	🎥 Walkthrough video	`ffmpeg` → `vision`	Frame-by-frame extraction; fire marshal notice on wall reads 180 persons, contradicting the marketed 300
3	📄 PDF quote	`PyMuPDF`	Page 3, clause 7: 200-person minimum spend — actual cost $9,000 vs quoted$ 6,750

The causal chain from step 1 → 2 is particularly notable: the model first captured an investigation lead from audio ("capacity may be inflated"), then used tools to convert video into image frames and searched them with that specific question in mind — this cross-modal reasoning chain from audio to visual evidence was unique to GPT-5.4; no other model completed it. This also underscores the importance of designing ClawMark as a multimodal benchmark.

Motivation

What a ClawMark task looks like

Enterprise Property Insurance Claim

Breaking-News Flash & Fact-Checking

Scale and contributors

Results

Per-scenario performance

Cost estimate (fair alignment under a no-cache assumption)

Findings

Case study: GPT-5.4's evidence chain on content_operation_task7

Case study: GPT-5.4's evidence chain on `content_operation_task7`