LEAPBench: How efficiently do LLMs learn?

Two models can reach the same answer,
but one takes far more tries to get there

By tracking learning trajectories, we find that the most efficient optimizer is frequently not the one with the highest final score.

Winner changes (28) Winner holds (27)

What models are the most efficient?

The table below compares search efficiency against a standard classical algorithm (GP-UCB). We evaluate how often each model outperforms this baseline, how close its overall path stays to parity, and how often scoring its full trajectory changes its ranking. Click any model to view its learning curve over the 30-iteration budget.

ModelClick a row to learn more	Beats baseline?	How efficient is the path?	Best final result	Most efficient path	Rank shifts

Claude Opus 4.8 is excluded from the primary model rankings because it was evaluated on a subset of 35 biology tasks. On this subset, it outperformed the baseline on 60.0% of the tasks.

Does scientific context help?

We tested the common assumption that providing scientific context (such as specific protein names or units) helps LLMs optimize better. By stripping away this terminology in a 'domain-agnostic' control group, we isolated how these domain priors can affect in-context learning.

Data-only prompt performs better Domain-labeled prompt performs better

Explore the task set

Scientific optimization tasks differ widely in how models approach them. Some tasks yield the same winner regardless of the metric used, while others reveal sharp disagreements between final scores and path efficiency.

TaskClick a row to learn more	Winner changes?	Beats baseline?	Does context help?	Endpoint winner	Trajectory winner

LEAPBench:How efficiently do LLMs learn?

Two models can reach the same answer,but one takes far more tries to get there

What models are the most efficient?

By-model comparison

Does scientific context help?

Explore the task set

LEAPBench:
How efficiently do LLMs learn?

Two models can reach the same answer,
but one takes far more tries to get there