The Art of Scaling Test-Time Compute for Large Language Models

Figure: A taste of how different inference strategies trade compute for accuracy across model families. Shaded regions illustrate which technique—shortest‑trace, beam search or majority voting—wins at various compute budgets.

TL;DR

How much “thinking time” should a language model get? Our study explores test‑time scaling—allocating extra compute during inference—and discovers that there’s no single best strategy. The optimal approach hinges on your model, the task and how much compute you can spare.

Why this research?

Early work on TTS took a scattershot approach: some papers claimed that letting models deliberate longer improves accuracy, while others observed that lengthy chains of thought reinforce errors and actually harm performance. These contradictions stem from limited experiments and a narrow focus on specific model families. The authors wanted to settle the debate by running a systematic, cross‑model comparison. Their study spans open‑source reasoning models (7 B–235 B parameters) and two complementary benchmarks—AIME 2024–2025 for symbolic mathematics and GPQA Diamond for conceptual science questions. By analysing both reasoning and non‑reasoning models, and by grouping them into short‑ or long‑horizon based on their training algorithms (e.g., GRPO vs. GSPO), the authors provide a principled framework for choosing the right inference strategy at runtime.

Main insights

No free lunch in TTS: No single method (beam search, self‑consistency, first‑finish search, etc.) universally outperforms the others. Optimality changes with compute budget and model type.
Short‑ vs. long‑horizon models: Models trained with GRPO or similar reinforcement learning methods tend to prefer short, concise reasoning chains; longer traces often degrade their accuracy. Conversely, models trained via GSPO can sustain deeper reasoning and benefit from longer chains on harder tasks.
Beam search isn’t your friend: Increasing the beam width rarely improves accuracy on reasoning tasks and sometimes hurts it, exemplifying “inverse scaling.”
First‑finish search shines for concise thinkers: For short‑horizon models, sampling many traces in parallel and stopping as soon as the first k finish yields the best cost–accuracy trade‑off.
Majority voting is robust: When compute is abundant, simply sampling multiple reasoning traces and picking the most frequent answer (majority voting) consistently outperforms length‑based filtering on both short‑ and long‑horizon models.

Keywords

Term	Description
Test‑Time Scaling (TTS)	Dynamically increasing compute at inference to improve reasoning.
Short‑Horizon vs. Long‑Horizon	Categories of models depending on whether shorter or longer reasoning traces lead to higher accuracy.
First‑Finish Search (FFS)	A strategy that samples N traces and majority‑votes among the first k to complete.
Majority Voting	Aggregating answers from multiple traces and choosing the most common one; effective at high compute budgets.
Beam Search	Maintaining several high‑probability reasoning paths; found to have diminishing returns on complex reasoning.

The Art of Scaling Test-Time Compute for Large Language Models

ArXiv Preprint

Abstract

TL;DR

Why this research?

Main insights

Keywords