ClawMark: A Living-World Benchmark for Multi-Day, Multimodal Coworker Agents
**ClawMark** is a benchmark for coworker agents — those built to work alongside a human across multiple working days, multiple services, and raw multimodal evidence. It ships with 100 tasks across 13 professional scenarios, a reproducible harness, and **fully rule-based scoring** (no LLM-as-judge). It is contributed by **Evolvent** together with 40+ researchers from **NUS, HKU, MIT, UW, and UC Berkeley**. We evaluated six models across all 100 tasks, three times each. The current leaderboard top is **55.0**, and every model shows visible headroom on every scenario.
agentopenclawbenchmark