GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
Pith reviewed 2026-06-28 07:43 UTC · model grok-4.3
The pith
By modeling global feature distributions from labeled data, GeoMin decodes rollout discrepancies to reliably assess self-rewards on unlabeled data for efficient semi-supervised RLVR.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoMin models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data.
What carries the argument
Geometric distribution modeling of global features from labeled data to identify structural discrepancies between correct and incorrect rollouts and build a prior for self-reward reliability assessment.
If this is right
- Outperforms the strongest baselines by 4.1 percent on standard RLVR benchmarks.
- Surpasses fully supervised models while using only 10 percent of the annotations.
- Overcomes the data-efficiency limit caused by coarse performance heuristics that waste most unlabeled instances.
- Allows more unlabeled data to contribute to training once self-reward signals are scored with the learned prior.
Where Pith is reading between the lines
- The same distribution-based prior might improve semi-supervised training in other generation tasks where reward signals are noisy, such as code synthesis.
- Feature-space geometry could serve as a general signal for distinguishing high-quality from low-quality model outputs without additional labels.
- If the prior remains stable across training iterations, the method might support repeated rounds of self-improvement on unlabeled data.
Load-bearing premise
That the global feature distributions learned from the labeled data capture the structural differences that distinguish correct rollouts from incorrect ones in a way that predicts self-reward reliability on new unlabeled examples.
What would settle it
An experiment in which the geometric prior's scores for self-reward reliability show no correlation with actual rollout correctness on held-out data, or in which performance gains vanish when the distribution modeling step is removed.
Figures
read the original abstract
Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GeoMin for semi-supervised RLVR, which fits geometric distributions on a small labeled set to decode structural discrepancies between correct and incorrect rollouts and thereby construct a prior for assessing self-reward reliability on unlabeled data. It reports empirical gains of +4.1% over baselines and superiority to fully supervised training using only 10% of the annotations.
Significance. If the geometric modeling reliably isolates rollout-quality signals rather than label artifacts or noise, and if the reported gains are reproducible under standard controls, the approach would offer a concrete route to lowering annotation costs in LLM reasoning while mitigating collapse risks in unsupervised RLVR.
major comments (3)
- [Abstract] Abstract: the central empirical claims (+4.1% improvement and outperformance of full supervision at 10% labels) are stated without any description of tasks, baselines, statistical tests, variance estimates, or controls, rendering the numbers impossible to evaluate against the modeling claim.
- [Method] Method section (geometric distribution modeling): the assertion that global feature distributions fitted on the labeled set decode the structural discrepancy between correct and incorrect rollouts is presented without equations for the distribution family, feature extractor architecture, or any quantitative diagnostics (e.g., separation metrics, KL divergence between correct/incorrect classes, or ablation on distribution fidelity).
- [Experiments] Experiments: no evidence is supplied that the fitted prior is used in a non-circular manner when scoring self-reward signals on the unlabeled set, leaving open the possibility that performance gains arise from label leakage or heuristic reuse rather than the proposed geometric prior.
minor comments (2)
- Notation for the geometric distribution parameters and the self-reward reliability score should be introduced with explicit definitions and a small illustrative example.
- [Abstract] The abstract would benefit from a one-sentence statement of the datasets or reasoning benchmarks used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to improve clarity in the abstract, formalization in the method, and transparency in the experimental protocol. We address each point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central empirical claims (+4.1% improvement and outperformance of full supervision at 10% labels) are stated without any description of tasks, baselines, statistical tests, variance estimates, or controls, rendering the numbers impossible to evaluate against the modeling claim.
Authors: We agree that the abstract would benefit from additional context. In the revised manuscript we expand the abstract to name the evaluation tasks (GSM8K and MATH), list the primary baselines, and note that results are reported as means with standard deviations across three random seeds. revision: yes
-
Referee: [Method] Method section (geometric distribution modeling): the assertion that global feature distributions fitted on the labeled set decode the structural discrepancy between correct and incorrect rollouts is presented without equations for the distribution family, feature extractor architecture, or any quantitative diagnostics (e.g., separation metrics, KL divergence between correct/incorrect classes, or ablation on distribution fidelity).
Authors: The original method section describes the high-level idea but omits the requested formal details. We have added the explicit geometric PMF, the architecture of the rollout embedding extractor, KL-divergence values between the fitted correct and incorrect distributions, and an ablation confirming distribution fidelity. revision: yes
-
Referee: [Experiments] Experiments: no evidence is supplied that the fitted prior is used in a non-circular manner when scoring self-reward signals on the unlabeled set, leaving open the possibility that performance gains arise from label leakage or heuristic reuse rather than the proposed geometric prior.
Authors: The prior is constructed exclusively from the labeled set and applied to unlabeled rollouts without using their ground-truth labels. We have inserted a data-flow diagram and an ablation that isolates the contribution of the geometric prior versus simple heuristics, showing that the reported gains are attributable to the prior. revision: partial
Circularity Check
No circularity: standard semi-supervised modeling with independent empirical claims
full rationale
The abstract describes fitting global feature distributions on a small labeled set to derive a prior for assessing self-reward reliability on unlabeled data. This is a conventional semi-supervised construction that does not reduce any claimed prediction or result to the input fit by definition, nor does it rely on self-citation chains or imported uniqueness theorems. No equations or derivation steps are shown that equate outputs to inputs by construction. The +4.1% performance claim remains an external empirical assertion rather than a tautological renaming or forced statistical outcome.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Openai o1 system card , author=. arXiv preprint arXiv:2412.16720 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Nature , volume=
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=
2025
-
[3]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms , author=. arXiv preprint arXiv:2506.14245 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
A Survey of Reinforcement Learning for Large Reasoning Models
A survey of reinforcement learning for large reasoning models , author=. arXiv preprint arXiv:2509.08827 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
arXiv preprint arXiv:2506.18254 , year=
RLPR: Extrapolating RLVR to General Domains without Verifiers , author=. arXiv preprint arXiv:2506.18254 , year=
-
[8]
Cross- ing the reward bridge: Expanding RL with verifiable rewards across diverse domains
Crossing the reward bridge: Expanding rl with verifiable rewards across diverse domains , author=. arXiv preprint arXiv:2503.23829 , year=
-
[9]
How far can unsupervised rlvr scale llm training? arXiv preprint arXiv:2603.08660, 2026
How Far Can Unsupervised RLVR Scale LLM Training? , author=. arXiv preprint arXiv:2603.08660 , year=
-
[10]
Advances in Neural Information Processing Systems , volume=
Absolute zero: Reinforced self-play reasoning with zero data , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
TTRL: Test-Time Reinforcement Learning
Ttrl: Test-time reinforcement learning , author=. arXiv preprint arXiv:2504.16084 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning
The unreasonable effectiveness of entropy minimization in llm reasoning , author=. arXiv preprint arXiv:2505.15134 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Learning to Reason without External Rewards
Learning to reason without external rewards , author=. arXiv preprint arXiv:2505.19590 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning , author=. arXiv preprint arXiv:2506.08745 , year=
-
[15]
arXiv preprint arXiv:2508.00410 , year=
Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models , author=. arXiv preprint arXiv:2508.00410 , year=
-
[16]
Can large reasoning models self-train?arXiv preprint arXiv:2505.21444,
Can Large Reasoning Models Self-Train? , author=. arXiv preprint arXiv:2505.21444 , year=
-
[17]
arXiv preprint arXiv:2506.17219 , year=
No Free Lunch: Rethinking Internal Feedback for LLM Reasoning , author=. arXiv preprint arXiv:2506.17219 , year=
-
[18]
arXiv preprint arXiv:2512.13106 , year=
TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning , author=. arXiv preprint arXiv:2512.13106 , year=
-
[19]
arXiv preprint arXiv:2601.08393 , year=
Controlled llm training on spectral sphere , author=. arXiv preprint arXiv:2601.08393 , year=
-
[20]
Advances in Neural Information Processing Systems , volume=
Nemotron-flash: Towards latency-optimal hybrid small language models , author=. Advances in Neural Information Processing Systems , volume=
-
[21]
Advances in neural information processing systems , volume=
Root mean square layer normalization , author=. Advances in neural information processing systems , volume=
-
[22]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
Probabilistic contrastive learning for long-tailed visual recognition , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=
-
[24]
International Conference on Machine Learning , pages=
On variational bounds of mutual information , author=. International Conference on Machine Learning , pages=. 2019 , organization=
2019
-
[25]
Computational Statistics , volume=
A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of I s (x) , author=. Computational Statistics , volume=. 2012 , publisher=
2012
-
[26]
Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning , author=. arXiv preprint arXiv:2504.11456 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Proceedings of the Twentieth European Conference on Computer Systems , pages=
Hybridflow: A flexible and efficient rlhf framework , author=. Proceedings of the Twentieth European Conference on Computer Systems , pages=
-
[28]
Learning to Reason under Off-Policy Guidance
Learning to reason under off-policy guidance , author=. arXiv preprint arXiv:2504.14945 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Hugging Face repository , volume=
Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions , author=. Hugging Face repository , volume=
-
[30]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[31]
Advances in neural information processing systems , volume=
Solving quantitative reasoning problems with language models , author=. Advances in neural information processing systems , volume=
-
[32]
Measuring Mathematical Problem Solving With the MATH Dataset
Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
First Conference on Language Modeling , year=
Gpqa: A graduate-level google-proof q&a benchmark , author=. First Conference on Language Modeling , year=
-
[35]
Advances in Neural Information Processing Systems , volume=
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
arXiv preprint arXiv:2505.22660 , year=
Maximizing confidence alone improves reasoning , author=. arXiv preprint arXiv:2505.22660 , year=
-
[37]
Confidence is all you need: Few-shot rl fine-tuning of language models , author=. arXiv preprint arXiv:2506.06395 , year=
-
[38]
arXiv preprint arXiv:2507.21931 , year=
Post-training large language models via reinforcement learning from self-feedback , author=. arXiv preprint arXiv:2507.21931 , year=
-
[39]
arXiv preprint arXiv:2508.11356 , year=
Ettrl: Balancing exploration and exploitation in llm test-time reinforcement learning via entropy mechanism , author=. arXiv preprint arXiv:2508.11356 , year=
-
[40]
Advances in Neural Information Processing Systems , volume=
Serl: Self-play reinforcement learning for large language models with limited data , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
arXiv preprint arXiv:2508.12338 , year=
Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback , author=. arXiv preprint arXiv:2508.12338 , year=
-
[42]
Advances in neural information processing systems , volume=
Right question is already half the answer: Fully unsupervised llm reasoning incentivization , author=. Advances in neural information processing systems , volume=
-
[43]
Advances in Neural Information Processing Systems , volume=
Consistent paths lead to truth: Self-rewarding reinforcement learning for llm reasoning , author=. Advances in Neural Information Processing Systems , volume=
-
[44]
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO: Scaling Test-time Training for Large Reasoning Models , author=. arXiv preprint arXiv:2604.19295 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Advances in neural information processing systems , volume=
Learning with noisy labels , author=. Advances in neural information processing systems , volume=
-
[46]
Exgrpo: Learning to reason from experience , author=. arXiv preprint arXiv:2510.02245 , year=
-
[47]
Rate or Fate? RLV ^
Rad, Ali and Filom, Khashayar and Keivan, Darioush and Esfahani, Peyman Mohajerin and Kamalinejad, Ehsan , journal=. Rate or Fate? RLV ^
-
[48]
Spurious Rewards: Rethinking Training Signals in RLVR
Spurious rewards: Rethinking training signals in rlvr , author=. arXiv preprint arXiv:2506.10947 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
arXiv preprint arXiv:2603.16140 , year=
Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards , author=. arXiv preprint arXiv:2603.16140 , year=
-
[50]
Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Reinforcement learning with verifiable yet noisy rewards under imperfect verifiers , author=. arXiv preprint arXiv:2510.00915 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
arXiv preprint arXiv:2505.22653 , year=
The climb carves wisdom deeper than the summit: On the noisy rewards in learning to reason , author=. arXiv preprint arXiv:2505.22653 , year=
-
[52]
arXiv preprint arXiv:2505.22203 , year=
From Accuracy to Robustness: A Study of Rule-and Model-based Verifiers in Mathematical Reasoning , author=. arXiv preprint arXiv:2505.22203 , year=
-
[53]
Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V
Can prompt difficulty be online predicted for accelerating rl finetuning of reasoning models? , author=. Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 1 , pages=
-
[54]
Cancer , volume=
Index for rating diagnostic tests , author=. Cancer , volume=. 1950 , publisher=
1950
-
[55]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? , author=. arXiv preprint arXiv:2504.13837 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
Can LLMs Learn to Reason Robustly under Noisy Supervision?
Can LLMs Learn to Reason Robustly under Noisy Supervision? , author=. arXiv preprint arXiv:2604.03993 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[57]
IEEE transactions on pattern analysis and machine intelligence , volume=
Virtual adversarial training: a regularization method for supervised and semi-supervised learning , author=. IEEE transactions on pattern analysis and machine intelligence , volume=. 2018 , publisher=
2018
-
[58]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Dual student: Breaking the limits of the teacher in semi-supervised learning , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[59]
IEEE Transactions on Smart Grid , volume=
Detecting false data injection attacks in smart grids: A semi-supervised deep learning approach , author=. IEEE Transactions on Smart Grid , volume=. 2020 , publisher=
2020
-
[60]
Proceedings of the AAAI conference on artificial intelligence , volume=
Curriculum labeling: Revisiting pseudo-labeling for semi-supervised learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[61]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
DiCaP: Distribution-Calibrated Pseudo-labeling for Semi-Supervised Multi-Label Learning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[62]
Advances in neural information processing systems , volume=
Fixmatch: Simplifying semi-supervised learning with consistency and confidence , author=. Advances in neural information processing systems , volume=
-
[63]
Advances in neural information processing systems , volume=
Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling , author=. Advances in neural information processing systems , volume=
-
[64]
arXiv preprint arXiv:2205.07246 , year=
Freematch: Self-adaptive thresholding for semi-supervised learning , author=. arXiv preprint arXiv:2205.07246 , year=
-
[65]
Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning,
Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning , author=. arXiv preprint arXiv:2301.10921 , year=
-
[66]
International Conference on Learning Representations , volume=
Semireward: A general reward model for semi-supervised learning , author=. International Conference on Learning Representations , volume=
-
[67]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Cgmatch: A different perspective of semi-supervised learning , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[68]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement , author=. arXiv preprint arXiv:2409.12122 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.