Agentic benchmarks on messy, real-world biological data
Each problem includes a snapshot of real experimental data taken immediately prior to a target analysis step, a description of the task through a high-level scientific lens, and a deterministic grader (e.g., Jaccard similarity of sets) that evaluates recovery of the key biological result in a verifiable manner. The benchmark is designed to test durable biological reasoning rather than method-specific implementation details and require empirical interaction with the data.
Leaderboard
GPT-5.5
57.6%
GPT-5.4
57.4%
GPT-5.5(Codex)
53.7%
Opus 4.6
52.8%
Opus 4.7
52.4%
Gemini 3.1 Pro Preview
51.6%
Opus 4.7(CC)
51.4%
Grok-4.20 Beta(Reasoning)
45.9%
Sonnet 4.6
44.2%
Grok-4.1 Fast(Reasoning)
34.0%
Grok-4
32.9%
Gemini 2.5 Pro
29.1%
0
25%
50%
75%
100%
Anthropic
OpenAI
Google
xAI
| Model (Harness) | Pass Rate | Single Cell Accuracy | Spatial Accuracy | Avg. Cost | Duration |
|---|---|---|---|---|---|
GPT-5.5 | 57.6% | Coming soon | 57.6% | $1.121 | 587s |
GPT-5.4 | 57.4% | Coming soon | 57.4% | $0.577 | 1129s |
GPT-5.5 (Codex) | 53.7% | Coming soon | 53.7% | $3.162 | 382s |
Claude Opus 4.6 | 52.8% | 52.8% | 52.8% | $0.600 | 303s |
Claude Opus 4.7 | 52.4% | Coming soon | 52.4% | $0.000 | 627s |
Gemini 3.1 Pro Preview | 51.6% | Coming soon | 51.6% | $0.936 | 1062s |
Claude Opus 4.7 (Claude Code) | 51.4% | Coming soon | 51.4% | $0.802 | 533s |
Grok-4.20 Beta (Reasoning) | 45.9% | Coming soon | 45.9% | $0.168 | 343s |
Claude Sonnet 4.6 | 44.2% | Coming soon | 44.2% | $0.273 | 405s |
Grok-4.1 Fast (Reasoning) | 34.0% | Coming soon | 34.0% | $0.016 | 357s |
Grok-4 | 32.9% | 33.9% | 31.9% | $0.100 | 203s |
Gemini 2.5 Pro | 29.1% | 29.2% | 28.9% | $0.300 | 300s |
Showing 12 of 12 models · 95% CI · 3 runs
Evaluation Examples
Several examples spanning task categories and platforms. Expand any row to inspect model runs.
Showing 5 / 553