SPIRAL: Learning to Search and Aggregate

· 2026 · cs.AI · arXiv 2606.23595

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Language model reasoning can be substantially improved at test time via scaffolds that scale inference compute across different primitives -- sequential reasoning within a trace, independently sampled parallel traces, and aggregation of multiple reasoning traces into a final response. During post-training, however, language models are optimized only for sequential reasoning within a single trace. We introduce Sequential-Parallel-Aggregative Reinforcement Learning (SPIRAL), a framework in which a language model is trained to use all three primitives, as part of a unified inference compute pipeline. Concretely, the language model first samples a set of independent traces in parallel, each produced through sequential chain-of-thought reasoning, and then generates a final aggregation trace conditioned on those traces; all components are optimized end-to-end against the reward of the final aggregated response. To train this system, SPIRAL uses set reinforcement learning to teach models to produce a set of traces that are collectively useful for an aggregator and standard reinforcement learning to teach models to aggregate the set into improved final responses. Our experiments on reasoning tasks show that SPIRAL effectively scales with inference compute, outperforming GRPO by up to 11$\times$ scaling efficiency and 15% higher performance when all three compute primitives are scaled.

representative citing papers

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

QuasiMoTTo uses quasi-Monte Carlo to produce correlated yet marginally correct samples from language models, matching i.i.d. pass@k with 25-47% fewer samples on reasoning benchmarks and 50% fewer RL training steps.

citing papers explorer

Showing 1 of 1 citing paper after filters.

QuasiMoTTo: Quasi-Monte Carlo Test-Time Scaling cs.LG · 2026-07-01 · unverdicted · none · ref 12 · internal anchor
QuasiMoTTo uses quasi-Monte Carlo to produce correlated yet marginally correct samples from language models, matching i.i.d. pass@k with 25-47% fewer samples on reasoning benchmarks and 50% fewer RL training steps.

SPIRAL: Learning to Search and Aggregate

fields

years

verdicts

representative citing papers

citing papers explorer