arxiv: 2504.16084 · v3 · submitted 2025-04-22 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

TTRL: Test-Time Reinforcement Learning

Biqing Qi, Bowen Zhou, Ermo Hua, Ganqu Cui, Haozhan Li, Kaiyan Zhang, Lifan Yuan, Li Sheng, Ning Ding, Shang Qu, Xinwei Long, Xuekai Zhu, Youbang Sun, Yuchen Zhang, Yuxin Zuo, Zhiyuan Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:42 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords test-time reinforcement learningunlabeled dataLLM reasoningmajority votingself-improvementAIME benchmarktest-time scaling

0 comments

The pith

TTRL lets LLMs improve reasoning on unlabeled test data by treating majority voting as an RL reward.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Test-Time Reinforcement Learning (TTRL) to apply RL to large language models using only unlabeled data on reasoning tasks. It finds that majority voting from test-time scaling supplies an effective reward signal that drives training without ground-truth labels. Experiments show this yields large gains, such as a 211 percent rise in pass@1 for Qwen-2.5-Math-7B on AIME 2024, and lets the model exceed the maj@n ceiling of its starting point. The method draws on the model's own pre-trained priors for self-evolution. A reader would care because the result points to a practical route for improving deployed models on new problems without collecting labeled examples.

Core claim

TTRL trains LLMs via reinforcement learning on unlabeled test inputs by using majority voting outcomes as the reward, enabling consistent performance gains that surpass the initial model's maj@n upper limit and approach the results of training with ground-truth labels.

What carries the argument

Majority voting as a reward estimator inside an RL loop applied directly to unlabeled test-time inputs.

Load-bearing premise

Majority voting produces a reliable enough reward signal to guide useful RL updates without any ground-truth labels.

What would settle it

Applying TTRL to a new reasoning benchmark and measuring no improvement or a drop in pass@1 accuracy compared with the baseline model would refute the central claim.

read the original abstract

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TTRL turns majority voting into an RL reward for unlabeled test data and reports big reasoning lifts, but the signal may just reinforce common mistakes when the base model is weak.

read the letter

TTRL takes the standard test-time trick of majority voting over multiple samples and feeds it directly into RL updates on the same unlabeled test set. The headline numbers are a claimed 211% pass@1 jump for Qwen-2.5-Math-7B on AIME 2024 and the ability to beat the original model's maj@n ceiling while approaching fully supervised performance. The setup is straightforward: sample, vote for a pseudo-label, run RL, repeat on the test distribution. That framing is the main novelty, and the experiments show the pattern holds across several math and reasoning tasks. The paper does a decent job documenting consistent gains without needing ground-truth labels, which is the practical hook. The citation pattern sits cleanly on existing test-time scaling and RL-for-LLMs work without obvious omissions. The soft spot is the one the stress test flags. When base accuracy is low, as it is on AIME for a 7B model, the majority vote is frequently wrong, so the reward can push the policy toward the dominant error rather than the correct answer. The abstract gives no correlation between the maj@n reward and actual correctness, no ablation with random or constant rewards, and no check for overfitting from repeated sampling on the identical test split. Without those, it is hard to separate real self-improvement from sampling artifacts. RL hyperparameters and the exact algorithm are also thin. This is for groups already running test-time scaling experiments who want a simple way to squeeze more out of unlabeled data. A reader focused on practical inference tricks would get value from the numbers, but would need the full controls before treating the gains as settled. It deserves peer review because the idea is easy to test and the reported effect size is large enough to matter if it survives the obvious checks.

Referee Report

3 major / 2 minor

Summary. The paper introduces Test-Time Reinforcement Learning (TTRL), a method for applying RL to LLMs on unlabeled data for reasoning tasks by deriving rewards from majority voting (maj@n). It claims that this enables self-evolution, yielding a ~211% pass@1 gain on AIME 2024 for Qwen-2.5-Math-7B and allowing the model to surpass the initial model's maj@n upper bound while approaching supervised performance.

Significance. If the central claims hold after verification, the work would be significant for demonstrating label-free RL self-improvement at test time in reasoning domains, extending test-time scaling techniques into a training loop and reducing dependence on ground-truth labels.

major comments (3)

[Abstract] Abstract: The 211% pass@1 improvement on AIME 2024 is reported without any description of the RL algorithm (e.g., PPO, GRPO), learning rate, number of gradient steps, or value of n in maj@n, preventing verification of the result or reproduction.
[Experiments] Experiments section: No ablation replaces the maj@n reward with a random or constant baseline reward; without this control, the reported gains cannot be distinguished from artifacts of repeated sampling on the fixed unlabeled test set.
[Method] Method: The manuscript provides no correlation analysis between maj@n-derived rewards and ground-truth correctness on the AIME 2024 split; given the low base pass@1 of Qwen-2.5-Math-7B, this leaves open the possibility that the reward reinforces dominant errors rather than correct solutions.

minor comments (2)

[Abstract] Abstract: The notation 'maj@n' is used without an explicit definition or reference to the number of samples n employed in the experiments.
[Abstract] The GitHub link is provided but the manuscript does not indicate whether the released code includes the exact hyperparameters and random seeds used for the AIME 2024 results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and outline the corresponding revisions.

read point-by-point responses

Referee: [Abstract] Abstract: The 211% pass@1 improvement on AIME 2024 is reported without any description of the RL algorithm (e.g., PPO, GRPO), learning rate, number of gradient steps, or value of n in maj@n, preventing verification of the result or reproduction.

Authors: We agree that the abstract should include these implementation details to support verification. The full manuscript describes the use of GRPO as the RL algorithm along with the associated hyperparameters in Section 3. In the revised version we will explicitly state the RL algorithm, learning rate, number of gradient steps, and the value of n in maj@n directly in the abstract. revision: yes
Referee: [Experiments] Experiments section: No ablation replaces the maj@n reward with a random or constant baseline reward; without this control, the reported gains cannot be distinguished from artifacts of repeated sampling on the fixed unlabeled test set.

Authors: This is a fair observation. While the reported gains include surpassing the initial model's maj@n upper bound (which would not occur from sampling alone), we acknowledge that an explicit random-reward control would further isolate the contribution of the learned policy. We will add this ablation experiment to the revised Experiments section. revision: yes
Referee: [Method] Method: The manuscript provides no correlation analysis between maj@n-derived rewards and ground-truth correctness on the AIME 2024 split; given the low base pass@1 of Qwen-2.5-Math-7B, this leaves open the possibility that the reward reinforces dominant errors rather than correct solutions.

Authors: We recognize the importance of this analysis. The observation that TTRL exceeds the initial maj@n performance already suggests the model is not merely reinforcing majority errors, yet we will strengthen the manuscript by adding an explicit correlation study between maj@n rewards and ground-truth correctness on the AIME 2024 split in the revised Method or Experiments section. revision: yes

Circularity Check

0 steps flagged

No significant circularity: reward signal is external majority vote, not self-referential fit

full rationale

The TTRL derivation uses majority voting over n samples drawn from the current policy as an external reward signal for RL updates on unlabeled data. This procedure is computed independently of the policy gradient steps and is not defined in terms of the final performance metric; the reported gains (including surpassing initial maj@n) are presented as empirical outcomes of optimization rather than algebraic identities. No load-bearing self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the core chain. The method remains self-contained against external benchmarks such as ground-truth pass@1 on held-out sets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that majority voting supplies a usable reward for RL without ground truth, plus the background assumption that pre-trained model priors are sufficient to drive useful self-evolution.

axioms (1)

domain assumption Majority voting among model outputs provides an effective reward signal for RL training on unlabeled reasoning data
Explicitly stated in the abstract as the key enabler of the method.

pith-pipeline@v0.9.0 · 5596 in / 1255 out tokens · 28654 ms · 2026-05-15T00:42:42.815218+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LawOfExistence defect_zero_iff_one unclear
TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Bounded Ratio Reinforcement Learning
cs.LG 2026-04 conditional novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.
MemDLM: Memory-Enhanced DLM Training
cs.CL 2026-03 unverdicted novelty 7.0

MemDLM embeds a simulated denoising trajectory into DLM training via bi-level optimization, creating a parametric memory that improves convergence and long-context performance even when the memory is dropped at test time.
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
cs.LG 2025-05 conditional novelty 7.0

A model trained only by proposing and solving its own verifiable code tasks achieves state-of-the-art results on math and coding benchmarks without external data.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
Gradient Extrapolation-Based Policy Optimization
cs.LG 2026-05 unverdicted novelty 6.0

GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and ...
Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.
Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents
cs.CL 2026-04 unverdicted novelty 6.0

ProactAgent learns a proactive retrieval policy via reinforcement learning on paired task continuations, improving lifelong agent performance and cutting retrieval overhead on SciWorld, AlfWorld, and StuLife.
TEMPO: Scaling Test-time Training for Large Reasoning Models
cs.LG 2026-04 unverdicted novelty 6.0

TEMPO scales test-time training for large reasoning models by interleaving policy refinement on unlabeled data with critic recalibration on labeled data via an EM formulation, yielding large gains on AIME tasks.
Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data
cs.LG 2026-04 unverdicted novelty 6.0

A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.
Characterizing Model-Native Skills
cs.AI 2026-04 conditional novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
cs.AI 2026-04 unverdicted novelty 6.0

MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
ZeroCoder: Can LLMs Improve Code Generation Without Ground-Truth Supervision?
cs.SE 2026-04 unverdicted novelty 6.0

ZeroCoder co-evolves coder and tester LLMs via self-generated code-test execution feedback to improve code generation up to 21.6% without ground-truth supervision.
Can LLMs Learn to Reason Robustly under Noisy Supervision?
cs.LG 2026-04 conditional novelty 6.0

Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...
GrandCode: Achieving Grandmaster Level in Competitive Programming via Agentic Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GrandCode is the first AI system to consistently beat all human participants and place first in live Codeforces competitive programming contests.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
cs.LG 2026-05 unverdicted novelty 5.0

PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Triviality Corrected Endogenous Reward
cs.CL 2026-04 unverdicted novelty 5.0

TCER corrects triviality bias in endogenous rewards for text generation by rewarding relative information gain modulated by probability correction, yielding consistent unsupervised improvements on writing benchmarks a...
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.