Pareto Research

LEAPBench:
How efficiently do LLMs learn?

LEAPBench evaluates eight frontier LLMs across 55 optimization tasks to identify the best learners.

Marilyn Zhang Tianfeng Chen Fabián Barzuna Kate M. Lubrano Ankita Rathod Mark E. Whiting
Explore results Read the paper

Two models can reach the same answer,
but one takes far more tries to get there

By tracking learning trajectories, we find that the most efficient optimizer is frequently not the one with the highest final score.

Winner changes (28) Winner holds (27)

What models are the most efficient?

The table below compares search efficiency against a standard classical algorithm (GP-UCB). We evaluate how often each model outperforms this baseline, how close its overall path stays to parity, and how often scoring its full trajectory changes its ranking. Click any model to view its learning curve over the 30-iteration budget.

By-model comparison

ModelClick a row to learn more
Beats baseline?
How efficient is the path?
Best final result
Most efficient path
Rank shifts
Claude Opus 4.8 is excluded from the primary model rankings because it was evaluated on a subset of 35 biology tasks. On this subset, it outperformed the baseline on 60.0% of the tasks.

Does scientific context help?

We tested the common assumption that providing scientific context (such as specific protein names or units) helps LLMs optimize better. By stripping away this terminology in a 'domain-agnostic' control group, we isolated how these domain priors can affect in-context learning.

Data-only prompt performs better Domain-labeled prompt performs better

Explore the task set

Scientific optimization tasks differ widely in how models approach them. Some tasks yield the same winner regardless of the metric used, while others reveal sharp disagreements between final scores and path efficiency.

TaskClick a row to learn more
Winner changes?
Beats baseline?
Does context help?
Endpoint winner
Trajectory winner