Agentic benchmarks on messy, real-world biological data

Each problem includes a snapshot of real experimental data taken immediately prior to a target analysis step, a description of the task through a high-level scientific lens, and a deterministic grader (e.g., Jaccard similarity of sets) that evaluates recovery of the key biological result in a verifiable manner. The benchmark is designed to test durable biological reasoning rather than method-specific implementation details and require empirical interaction with the data.

Leaderboard

Platform

Task Category

GPT-5.5

57.6%

GPT-5.4

57.4%

GPT-5.5(Codex)

53.7%

Opus 4.6

52.8%

Opus 4.7

52.4%

Gemini 3.1 Pro Preview

51.6%

Opus 4.7(CC)

51.4%

Grok-4.20 Beta(Reasoning)

45.9%

Sonnet 4.6

44.2%

Grok-4.1 Fast(Reasoning)

34.0%

Grok-4

32.9%

Gemini 2.5 Pro

29.1%

25%

50%

75%

100%

Anthropic

OpenAI

Google

xAI

AI model performance on genomics benchmarks.
Model (Harness)	Pass Rate	Single Cell Accuracy	Spatial Accuracy	Avg. Cost	Duration
GPT-5.5	57.6%	Coming soon	57.6%	$1.121	587s
GPT-5.4	57.4%	Coming soon	57.4%	$0.577	1129s
GPT-5.5 (Codex)	53.7%	Coming soon	53.7%	$3.162	382s
Claude Opus 4.6	52.8%	52.8%	52.8%	$0.600	303s
Claude Opus 4.7	52.4%	Coming soon	52.4%	$0.000	627s
Gemini 3.1 Pro Preview	51.6%	Coming soon	51.6%	$0.936	1062s
Claude Opus 4.7 (Claude Code)	51.4%	Coming soon	51.4%	$0.802	533s
Grok-4.20 Beta (Reasoning)	45.9%	Coming soon	45.9%	$0.168	343s
Claude Sonnet 4.6	44.2%	Coming soon	44.2%	$0.273	405s
Grok-4.1 Fast (Reasoning)	34.0%	Coming soon	34.0%	$0.016	357s
Grok-4	32.9%	33.9%	31.9%	$0.100	203s
Gemini 2.5 Pro	29.1%	29.2%	28.9%	$0.300	300s

Showing 12 of 12 models · 95% CI · 3 runs

Evaluation Examples

Several examples spanning task categories and platforms. Expand any row to inspect model runs.

Showing 5 / 553

Agentic benchmarks on messy, real-world biological data

Leaderboard

Evaluation Examples

Get notified when new benchmarks ship