benchmarks.bio

Agentic benchmarks on messy, real-world biological data

Each problem includes a snapshot of real experimental data taken immediately prior to a target analysis step, a description of the task through a high-level scientific lens, and a deterministic grader (e.g., Jaccard similarity of sets) that evaluates recovery of the key biological result in a verifiable manner. The benchmark is designed to test durable biological reasoning rather than method-specific implementation details and require empirical interaction with the data.

Leaderboard

GPT-5.5
57.6%
GPT-5.4
57.4%
GPT-5.5(Codex)
53.7%
Opus 4.6
52.8%
Opus 4.7
52.4%
Gemini 3.1 Pro Preview
51.6%
Opus 4.7(CC)
51.4%
Grok-4.20 Beta(Reasoning)
45.9%
Sonnet 4.6
44.2%
Grok-4.1 Fast(Reasoning)
34.0%
Grok-4
32.9%
Gemini 2.5 Pro
29.1%
0
25%
50%
75%
100%
Anthropic
OpenAI
Google
xAI
AI model performance on genomics benchmarks.
Model (Harness)Pass Rate Single Cell AccuracySpatial AccuracyAvg. Cost Duration
GPT-5.5
57.6%Coming soon57.6%$1.121587s
GPT-5.4
57.4%Coming soon57.4%$0.5771129s
GPT-5.5 (Codex)
53.7%Coming soon53.7%$3.162382s
Claude Opus 4.6
52.8%52.8%52.8%$0.600303s
Claude Opus 4.7
52.4%Coming soon52.4%$0.000627s
Gemini 3.1 Pro Preview
51.6%Coming soon51.6%$0.9361062s
Claude Opus 4.7 (Claude Code)
51.4%Coming soon51.4%$0.802533s
Grok-4.20 Beta (Reasoning)
45.9%Coming soon45.9%$0.168343s
Claude Sonnet 4.6
44.2%Coming soon44.2%$0.273405s
Grok-4.1 Fast (Reasoning)
34.0%Coming soon34.0%$0.016357s
Grok-4
32.9%33.9%31.9%$0.100203s
Gemini 2.5 Pro
29.1%29.2%28.9%$0.300300s
Showing 12 of 12 models · 95% CI · 3 runs

Evaluation Examples

Several examples spanning task categories and platforms. Expand any row to inspect model runs.

Showing 5 / 553

Get notified when new benchmarks ship

Occasional updates on new datasets, model results, and methodology. Unsubscribe anytime.