GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning

Daoyuan Chen; Ting Zhou; Ying Shen; Yiyang Zhao; Zhenqing Ling

arxiv: 2606.26917 · v1 · pith:DWZ5RXX7new · submitted 2026-06-25 · 💻 cs.LG · cs.AI

GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning

Ting Zhou , Zhenqing Ling , Yiyang Zhao , Ying Shen , Daoyuan Chen This is my paper

Pith reviewed 2026-06-26 04:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM reinforcement learningrollout curationdirectional inconsistencygeometric projectorpolicy optimizationtraining stabilityreward model alignment

0 comments

The pith

Geoalign curates LLM rollouts by angular deviation from batch consensus to reduce training instability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that directional inconsistency, where a few high-reward rollouts create preference directions in representation space that clash with the batch majority, produces high-variance updates during online RL for LLMs. It introduces Geoalign as a plug-in that builds within-prompt preference pairs, trains an online projector on hidden states to align with reward order, and swaps out rollouts whose angle from the batch consensus prototype is too large. The method runs with forward passes only and yields higher final performance plus lower oscillation on dialogue tasks with learned rewards and math tasks with binary verification. Readers might care because RL alignment of LLMs is often unstable under noisy rewards, and this offers a lightweight geometric test for rollout reliability.

Core claim

Within each batch a small set of high-reward rollouts can induce representation-space directions that sharply disagree with the majority, producing destabilizing updates; Geoalign forms within-prompt pairs, learns an online projector on hidden states to concentrate reward-ordered displacements, and replaces directionally inconsistent rollouts with within-prompt alternatives whose angle to the batch consensus prototype is smaller.

What carries the argument

The batch consensus prototype formed as the average direction in the learned projector space, used to measure angular deviation and thereby identify and replace destabilizing rollouts.

If this is right

Final performance rises on both dialogue alignment with learned reward models and mathematical reasoning with verified binary rewards.
Training oscillation decreases relative to standard online policy optimization.
The method outperforms PF-PPO, PAR, PODS, and Seed-GRPO on the reported tasks.
The curation adds negligible compute because it requires only forward passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric test might be tried in non-LLM online RL settings where batch-internal rollout quality varies.
If the learned projector is essential, a follow-up could check whether a fixed linear projector works equally well.
Directional consensus could be explored as a filtering signal during supervised stages that precede RL.

Load-bearing premise

Angular deviation from the batch consensus prototype in the learned projector space reliably marks rollouts that cause destabilizing updates rather than merely reflecting reward noise or prompt difficulty.

What would settle it

An experiment in which the same RL training runs are repeated with and without Geoalign curation, and the policy-update variance remains unchanged after the curation step or the flagged rollouts prove unrelated to the observed instability when isolated.

Figures

Figures reproduced from arXiv: 2606.26917 by Daoyuan Chen, Ting Zhou, Ying Shen, Yiyang Zhao, Zhenqing Ling.

**Figure 1.** Figure 1: Geometric turbulence in preference latent space: within one update step, most preference-induced directions align, while a small fraction forms angular outliers that can destabilize training. We call this failure mode directional inconsistency: within an RL iteration, a small set of rollouts induces preferenceimplied improvement directions that are strong angular outliers relative to the batch consensus,… view at source ↗

**Figure 2.** Figure 2: GEOALIGN overview. At each iteration, we form within-prompt preference displacements from low-reward to high-reward rollouts, project them onto a reward-sensitive manifold, construct a batch-wise consensus prototype, score rollouts by directional inconsistency, and rectify the experience buffer by replacing anomalous experiences with stable ones from the same prompt (equivalently, neutralizing their contri… view at source ↗

**Figure 3.** Figure 3: Anomaly score–rank distribution. Geometric inconsistency exhibits a long-tail pattern, motivating targeted intervention on a small subset of rollouts. This criterion enables GEOALIGN to automatically detect density collapses and isolate high-score anomalies that deviate from the collective improvement manifold. The remaining rollouts form the stable set Ystable. Replacement within the same prompt. For e… view at source ↗

**Figure 4.** Figure 4: Training dynamics on evaluation benchmarks. GEOALIGN yields smoother learning curves and higher final evaluation scores than robust RL baselines on DAPO-Math (binary verified reward) and HH-RLHF (continuous RM reward) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Injected reward errors concentrate among highGDI rollouts. Each panel ranks rollouts by GDI (blue line) for Math (left) and HH-RLHF (right). Black crosses denote rollouts whose rewards were corrupted, showing that injected errors are disproportionately located in the high-GDI region. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of latent vectors before (a, b) and after (c, d) projection. Raw vectors (a) are scattered, with widely dispersed cosine similarities to the centroid (b). After projection, latent vectors (c) become highly aligned, with nearly 90% achieving > 0.8 cosine similarity with the prototype (d). This demonstrates the projector’s role in enabling robust consensus detection. 5.4. Analysis and Ablation … view at source ↗

**Figure 7.** Figure 7: Evaluation curve between different rollouts’ curation strategy on Math (left) and HH-RLHF (right) [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of normalized (left) vs. cumulative (right) anomaly scores on the Math task. Cumulative scores provide a wider dynamic range, better distinguishing severe outliers. To further illustrate this, we explicitly inject 5% flip noise into the Math task [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: The projector learns a robust signal quickly and effectively. (a) Training and validation accuracy curves show rapid convergence without overfitting across all settings. (b) Box plots confirm consistently high median validation accuracy throughout training. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Training dynamics over the first 100 steps for GEOALIGN (red) and BASE-GRPO (black) across three random seeds on Qwen3-1.7B. The tight clustering of curves for both methods highlights the stability of the training process, with GEOALIGN consistently following a stable trajectory. Furthermore, to quantify the stability of evaluation outcomes, we report the mean and standard deviation of the final evaluatio… view at source ↗

**Figure 11.** Figure 11: PPO KL divergence and clipping fraction throughout training on Qwen3-1.7B across three random seeds. (a) AIME 2024 (b) AIME 2025 (c) AMC 2023 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Pass@k curves on AIME24/25 (n = 64) and AMC23 (n = 32). GEOALIGN maintains a slope equal to or steeper than BASE-GRPO across all k, confirming that solution diversity is preserved. C.5. Ablation on Rectification Strategy We ablate the rectification mechanism in Sec. 4.4 by comparing three strategies for handling anomalous rollouts: Replacement (the default in GEOALIGN, which substitutes anomalous rollout… view at source ↗

read the original abstract

Online reinforcement learning is widely used to align large language models (LLMs) with reward signals, yet training can be unstable under noisy or misspecified rewards. We identify a failure mode we call directional inconsistency: within a batch, a small set of high-reward rollouts induces representation-space preference directions that sharply disagree with the batch majority, resulting in high-variance and destabilizing updates. We propose geoalign, a lightweight plug-in for rollout curation in iterative policy optimization. Geoalign (i) forms within-prompt preference pairs, (ii) learns an online projector on per-rollout hidden states to concentrate reward-ordered displacement directions, and (iii) detects directionally inconsistent rollouts via their angular deviation from a batch consensus prototype and rectifies them with within-prompt stable alternatives. Geoalign is forward-pass only and adds negligible overhead. Across dialogue alignment with a learned reward model and mathematical reasoning with binary verified rewards, Geoalign improves final performance and reduces training oscillation, outperforming PF-PPO, PAR, PODS, and Seed-GRPO. These results suggest latent directional consensus as an effective reliability signal for online LLM RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geoalign offers a practical geometric filter for LLM rollout curation that reports stability gains, but the paper does not derive why angular deviation from batch consensus specifically reduces update variance.

read the letter

Geoalign is a plug-in for online LLM RL that forms within-prompt pairs, trains an online projector on hidden states to focus reward-ordered directions, and drops rollouts with high angular deviation from a batch prototype. The authors claim this fixes directional inconsistency and leads to better final performance plus less oscillation on dialogue tasks with learned rewards and math tasks with verified rewards, beating PF-PPO, PAR, PODS, and Seed-GRPO.

The combination of the online projector and angular filter is the main new piece. The method stays forward-pass only and adds almost no cost, which is a practical plus for people already running iterative policy optimization.

The results look consistent across the two reward regimes described. That is the part worth noting.

The soft spot is exactly the one flagged in the stress-test note. Nothing in the construction shows why deviation in the projector space flags rollouts that increase policy gradient variance instead of just marking atypical rewards or harder prompts. No derivation ties the angles to the variance term, and no ablation isolates the geometric step from simpler outlier removal. The gains are therefore compatible with other curation heuristics. This gap is central rather than minor.

The paper is aimed at practitioners who need quick stability fixes in LLM alignment pipelines. A reader already working on rollout selection or consistency checks would get the most from the empirical side.

It deserves peer review because the instability problem is real and the method is concrete enough to test, even if the justification needs more work.

Referee Report

2 major / 2 minor

Summary. The paper claims that online RL for LLMs suffers from 'directional inconsistency' where a minority of high-reward rollouts induce disagreeing preference directions in hidden-state space, causing high-variance updates. It proposes Geoalign as a plug-in that (i) forms within-prompt pairs, (ii) trains an online projector on hidden states to concentrate reward-ordered displacements, and (iii) culls high angular-deviation rollouts from a batch consensus prototype, replacing them with stable alternatives. Experiments on dialogue alignment (learned RM) and math reasoning (binary verified rewards) report improved final performance and reduced oscillation versus PF-PPO, PAR, PODS, and Seed-GRPO.

Significance. If the geometric filter is shown to specifically target variance-inducing directions rather than acting as a proxy for reward magnitude or prompt difficulty, the approach would supply a lightweight, forward-pass-only curation signal that could stabilize iterative LLM policy optimization across reward types without extra models or heavy computation.

major comments (2)

[Section 3 (Method)] The core claim that angular deviation from the batch consensus prototype isolates destabilizing rollouts (rather than reward noise or prompt difficulty) lacks a derivation relating the projector angles to the policy-gradient variance term. The method description shows how the projector is fit and how deviation is measured, but provides no analysis or bound connecting the geometric test to update variance.
[Section 4 (Experiments)] Because the projector is trained online on the same rollouts it later filters, the reported gains are compatible with simpler curation heuristics; an ablation that removes the geometric component (or replaces it with reward-based or random selection) while keeping the within-prompt pairing is needed to establish that the angular test is load-bearing.

minor comments (2)

[Section 3] Notation for the projector and consensus prototype should be introduced with explicit equations rather than prose descriptions to allow readers to verify the forward-pass claim.
[Abstract / Section 4] The abstract states 'negligible overhead' but the paper would benefit from a short table or sentence reporting wall-clock or FLOPs added per iteration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and commit to revisions that strengthen the theoretical motivation and experimental validation of Geoalign.

read point-by-point responses

Referee: [Section 3 (Method)] The core claim that angular deviation from the batch consensus prototype isolates destabilizing rollouts (rather than reward noise or prompt difficulty) lacks a derivation relating the projector angles to the policy-gradient variance term. The method description shows how the projector is fit and how deviation is measured, but provides no analysis or bound connecting the geometric test to update variance.

Authors: We agree that the manuscript does not contain a formal derivation or bound explicitly linking angular deviation in the projected hidden-state space to the variance of the policy-gradient updates. The approach is motivated by the observation that directionally inconsistent high-reward rollouts produce opposing preference signals, which empirically manifests as training oscillation; however, this connection is currently supported only by the reported stability improvements rather than by a direct variance analysis. In the revision we will expand Section 3 with a discussion that relates the learned projector (which concentrates reward-ordered displacements) to gradient variance, including an illustrative derivation under simplified assumptions on the geometry of the hidden-state representations and how deviations from the batch consensus prototype increase the magnitude of conflicting update directions. revision: yes
Referee: [Section 4 (Experiments)] Because the projector is trained online on the same rollouts it later filters, the reported gains are compatible with simpler curation heuristics; an ablation that removes the geometric component (or replaces it with reward-based or random selection) while keeping the within-prompt pairing is needed to establish that the angular test is load-bearing.

Authors: We acknowledge that the current experimental section does not contain an internal ablation that retains the within-prompt pairing mechanism while replacing the angular-deviation filter with reward-magnitude or random selection. Such a control is necessary to isolate the contribution of the geometric test. We will add this ablation to Section 4 in the revised manuscript, evaluating the full Geoalign pipeline against variants that use reward-based or random replacement on both the dialogue-alignment and math-reasoning tasks, thereby demonstrating that the angular test is the load-bearing component of the observed gains in final performance and reduced oscillation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical curation method with no self-referential derivation or fitted prediction presented as first-principles result

full rationale

The paper describes an online projector learned from rollout hidden states to filter via angular deviation from batch consensus, followed by empirical gains on dialogue and math tasks. No equations, uniqueness theorems, or self-citations are shown that would reduce the claimed causal link or performance improvement to the fitting process itself by construction. The projector and filter are presented as a plug-in heuristic whose value is assessed via external benchmarks (PF-PPO, PAR, etc.), satisfying the self-contained criterion. No load-bearing step equates the geometric test to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; the central claim rests on the unstated assumption that representation-space angular deviation is a faithful proxy for update instability, plus whatever parameters are inside the learned projector and the batch-consensus computation. No explicit free parameters, axioms, or invented entities are named.

pith-pipeline@v0.9.1-grok · 5736 in / 1270 out tokens · 28870 ms · 2026-06-26T04:46:32.358504+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 8 linked inside Pith

[1]

Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

Pith/arXiv arXiv
[2]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

Chen, M., Chen, G., Wang, W., and Yang, Y . Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

arXiv
[3]

Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025a

Cheng, J., Xiong, G., Qiao, R., Li, L., Guo, C., Wang, J., Lv, Y ., and Wang, F.-Y . Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025a. Cheng, Z., Hao, S., Liu, T., Zhou, F., Xie, Y ., Yao, F., Bian, Y ., Zhuang, Y ., Dey, N., Zha, Y ., et al. Revisiting reinforcement learning...

arXiv
[4]

Dao, T., Fu, D., Ermon, S., Rudra, A., and R´e, C

URL https://arxiv.org/ abs/2506.21545. Dao, T., Fu, D., Ermon, S., Rudra, A., and R´e, C. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359,

arXiv
[5]

Reward shaping to mitigate reward hacking in rlhf

Fu, J., Zhao, X., Yao, C., Wang, H., Han, Q., and Xiao, Y . Reward shaping to mitigate reward hacking in rlhf. InICML 2025 Workshop on Reliable and Responsible Foundation Models,

2025
[6]

Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Pith/arXiv arXiv
[7]

W., and Zhao, H

Hu, Y ., Wu, F., Ye, H., Forsyth, D., Zou, J., Jiang, N., Ma, J. W., and Zhao, H. A snapshot of influence: A local data attribution framework for online reinforcement learning. arXiv preprint arXiv:2505.19281,

arXiv
[8]

Regularized best-of-n sampling with minimum bayes risk objective for language model alignment

Jinnai, Y ., Morimura, T., Ariu, K., and Abe, K. Regularized best-of-n sampling with minimum bayes risk objective for language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 9321–9347,

2025
[9]

Learnalign: Reasoning data selec- tion for reinforcement learning in large language models based on improved gradient alignment.arXiv preprint arXiv:2506.11480, 2025a

Li, S., Li, S., Yang, Z., Zhang, X., Chen, G., Xia, X., Liu, H., and Peng, Z. Learnalign: Reasoning data selec- tion for reinforcement learning in large language models based on improved gradient alignment.arXiv preprint arXiv:2506.11480, 2025a. Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Li, Y ....

Pith/arXiv arXiv 2024
[10]

Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models.arXiv preprint arXiv:2505.17826,

Pan, X., Chen, Y ., Chen, Y ., Sun, Y ., Chen, D., Zhang, W., Xie, Y ., Huang, Y ., Zhang, Y ., Gao, D., et al. Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models.arXiv preprint arXiv:2505.17826,

arXiv
[11]

K., Wu, Y ., and Guo, D

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

Pith/arXiv arXiv
[12]

F., Akter, S., and Sharma, A

Shihab, I. F., Akter, S., and Sharma, A. Detecting and mitigating reward hacking in reinforcement learning sys- tems: A comprehensive empirical study.arXiv preprint arXiv:2507.05619,

arXiv
[13]

Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845,

Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845,

arXiv
[14]

E., Savani, Y ., Fang, F., and Kolter, J

Xu, Y . E., Savani, Y ., Fang, F., and Kolter, J. Z. Not all rollouts are useful: Down-sampling rollouts in llm rein- forcement learning.arXiv preprint arXiv:2504.13818,

Pith/arXiv arXiv
[15]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122, 2024a. 10 GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zhen...

Pith/arXiv arXiv
[16]

Reg- ularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024b

Yang, R., Ding, R., Lin, Y ., Zhang, H., and Zhang, T. Reg- ularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024b. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learni...

Pith/arXiv arXiv
[17]

Policy filtration for rlhf to mitigate noise in reward models

Zhang, C., Shen, W., Zhao, L., Zhang, X., Xu, X., Dou, W., and Bian, J. Policy filtration for rlhf to mitigate noise in reward models. InForty-second International Conference on Machine Learning, 2025a. Zhang, C., Shen, W., Zhao, L., Zhang, X., Xu, X., Dou, W., and Bian, J. Policy filtration for RLHF to mitigate noise in reward models. InForty-second Inte...

Pith/arXiv arXiv
[18]

This dataset consists of 17,398 problems, primarily from competitive mathematics, with integer-based answers

for training. This dataset consists of 17,398 problems, primarily from competitive mathematics, with integer-based answers. The verified and binary reward structure (1.0 for a correct answer, 0.0 12 GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning otherwise) provides a clear, objective signal for improving logical and computation...

2022
[19]

We utilize GRPO (Shao et al.,

via flash-attention (Dao et al., 2022). We utilize GRPO (Shao et al.,

2022
[20]

Tie" should be extremely rare. Return ONLY a JSON object: {

( 1.7B-Base and 4B-Base) undergo supervised fine-tuning (SFT) on the HH-RLHF dataset to align with the desired conversational style. Subsequently, these SFT models are further optimized using RL. The reward signal is continuous, obtained by querying an external reward modelArmoRM(Wang et al., 2024). Given the rich, continuous feedback from the reward mode...

2024
[21]

”Inst.” is an abbreviation for Instruct-tuned models

Table 6.Key hyperparameters for SFT and RL across different models and domains. ”Inst.” is an abbreviation for Instruct-tuned models. RLHF (SFT) RLHF (RL) Mathematics (RL) Hyperparameter 1.7B 4B 1.7B 4B 1.7B 4B Base Model Qwen3-1.7B-Base Qwen3-4B-Base SFT-tuned SFT-tuned Qwen3-1.7B-Inst. Qwen3-4B-Inst. Common Training Parameters Optimizer AdamW AdamW Adam...

2024
[22]

aggregates these directions across pairs, so correct but creative solutions naturally shape the consensus. Rollouts are flagged as outliers only when their implied directions contradict this collective flow, typically due to reward hacking or logical inconsistency rather than genuine novelty. To directly test whether GEOALIGNsuppresses solution diversity,...

2024
[23]

23.98 19.50 57.75 74.52 20.22 42.90 39.81 PF-PPO (BR) (Zhang et al., 2025a) 22.75 16.87 53.75 73.64 19.1943.50 38.28 PF-PPO (BW) (Zhang et al., 2025a) 22.83 19.38 55.37 73.40 19.41 42.31 38.78 PAR (Fu et al.,

1943
[24]

Qwen3-4B23.33 20.42 65.63 77.20 21.32 44.89 42.13 BASE-GRPO (Shao et al.,

23.08 17.0859.7575.16 19.93 42.49 39.58 SEED-GRPO (Chen et al., 2025)24.2519.48 59.62 74.60 19.85 42.99 40.13 GEOALIGN(Ours)24.17 21.6758.5075.28 20.4042.62 40.44 Table 14.Comparison of different methods on 4B-Math BenchmarksAIME24 AIME25 AMC23 MATH500 Minerva Olympiadbench Avg. Qwen3-4B23.33 20.42 65.63 77.20 21.32 44.89 42.13 BASE-GRPO (Shao et al.,

arXiv 2025
[25]

41.17 36.67 85.13 83.6826.7655.32 54.79 PF-PPO (BR) (Zhang et al., 2025a) 42.08 34.33 84.25 84.48 26.54 54.84 54.42 PF-PPO (BW) (Zhang et al., 2025a) 39.58 34.42 80.75 83.36 25.00 55.11 53.04 PAR (Fu et al.,

arXiv
[26]

Complete Results of Overall Performance on Math We present the complete scores of Qwen3-1.7B and Qwen3-4B on HH-RLHF task in the validation experiments in Tables

43.00 35.92 85.50 84.52 25.81 56.09 55.14 SEED-GRPO (Chen et al., 2025)44.6737.58 83.75 84.56 26.69 56.18 55.57 GEOALIGN(Ours)43.00 40.33 85.62 84.6825.3756.65 55.94 D.2. Complete Results of Overall Performance on Math We present the complete scores of Qwen3-1.7B and Qwen3-4B on HH-RLHF task in the validation experiments in Tables

arXiv 2025
[27]

0.8311 0.8397 0.8354 0.8562 0.8782 0.8672 + PF-PPO (BR) (Zhang et al., 2025a) 0.8630 0.8482 0.8556 0.84720.90690.8771 + PF-PPO (BW) (Zhang et al., 2025a) 0.8409 0.8383 0.8396 0.8462 0.8858 0.8660 + PAR (Fu et al.,

arXiv
[28]

0.8330 0.8518 0.8424 0.8622 0.8864 0.8743 + GEOALIGN(Ours) 0.8806 0.8963 0.8885 0.87210.9067 0.8894 Table 16.Evaluation win-rate of different models on the HH-rlhf dataset Models Qwen3-1.7B Qwen3-4B GEOALIGNvs Baselines Win Lose Tie Win Lose Tie + BASE-GRPO (Shao et al.,

arXiv

[1] [1]

Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

Pith/arXiv arXiv

[2] [2]

Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

Chen, M., Chen, G., Wang, W., and Yang, Y . Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346,

arXiv

[3] [3]

Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025a

Cheng, J., Xiong, G., Qiao, R., Li, L., Guo, C., Wang, J., Lv, Y ., and Wang, F.-Y . Stop summation: Min-form credit assignment is all process reward model needs for reasoning.arXiv preprint arXiv:2504.15275, 2025a. Cheng, Z., Hao, S., Liu, T., Zhou, F., Xie, Y ., Yao, F., Bian, Y ., Zhuang, Y ., Dey, N., Zha, Y ., et al. Revisiting reinforcement learning...

arXiv

[4] [4]

Dao, T., Fu, D., Ermon, S., Rudra, A., and R´e, C

URL https://arxiv.org/ abs/2506.21545. Dao, T., Fu, D., Ermon, S., Rudra, A., and R´e, C. Flashat- tention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359,

arXiv

[5] [5]

Reward shaping to mitigate reward hacking in rlhf

Fu, J., Zhao, X., Yao, C., Wang, H., Han, Q., and Xiao, Y . Reward shaping to mitigate reward hacking in rlhf. InICML 2025 Workshop on Reliable and Responsible Foundation Models,

2025

[6] [6]

Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring math- ematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

Pith/arXiv arXiv

[7] [7]

W., and Zhao, H

Hu, Y ., Wu, F., Ye, H., Forsyth, D., Zou, J., Jiang, N., Ma, J. W., and Zhao, H. A snapshot of influence: A local data attribution framework for online reinforcement learning. arXiv preprint arXiv:2505.19281,

arXiv

[8] [8]

Regularized best-of-n sampling with minimum bayes risk objective for language model alignment

Jinnai, Y ., Morimura, T., Ariu, K., and Abe, K. Regularized best-of-n sampling with minimum bayes risk objective for language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 9321–9347,

2025

[9] [9]

Learnalign: Reasoning data selec- tion for reinforcement learning in large language models based on improved gradient alignment.arXiv preprint arXiv:2506.11480, 2025a

Li, S., Li, S., Yang, Z., Zhang, X., Chen, G., Xia, X., Liu, H., and Peng, Z. Learnalign: Reasoning data selec- tion for reinforcement learning in large language models based on improved gradient alignment.arXiv preprint arXiv:2506.11480, 2025a. Li, X., Zou, H., and Liu, P. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Li, Y ....

Pith/arXiv arXiv 2024

[10] [10]

Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models.arXiv preprint arXiv:2505.17826,

Pan, X., Chen, Y ., Chen, Y ., Sun, Y ., Chen, D., Zhang, W., Xie, Y ., Huang, Y ., Zhang, Y ., Gao, D., et al. Trinity-rft: A general-purpose and unified framework for reinforcement fine-tuning of large language models.arXiv preprint arXiv:2505.17826,

arXiv

[11] [11]

K., Wu, Y ., and Guo, D

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., Li, Y . K., Wu, Y ., and Guo, D. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

Pith/arXiv arXiv

[12] [12]

F., Akter, S., and Sharma, A

Shihab, I. F., Akter, S., and Sharma, A. Detecting and mitigating reward hacking in reinforcement learning sys- tems: A comprehensive empirical study.arXiv preprint arXiv:2507.05619,

arXiv

[13] [13]

Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845,

Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T. Inter- pretable preferences via multi-objective reward modeling and mixture-of-experts.arXiv preprint arXiv:2406.12845,

arXiv

[14] [14]

E., Savani, Y ., Fang, F., and Kolter, J

Xu, Y . E., Savani, Y ., Fang, F., and Kolter, J. Z. Not all rollouts are useful: Down-sampling rollouts in llm rein- forcement learning.arXiv preprint arXiv:2504.13818,

Pith/arXiv arXiv

[15] [15]

Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.arXiv preprint arXiv:2409.12122, 2024a. 10 GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zhen...

Pith/arXiv arXiv

[16] [16]

Reg- ularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024b

Yang, R., Ding, R., Lin, Y ., Zhang, H., and Zhang, T. Reg- ularizing hidden states enables learning generalizable reward model for llms.Advances in Neural Information Processing Systems, 37:62279–62309, 2024b. Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learni...

Pith/arXiv arXiv

[17] [17]

Policy filtration for rlhf to mitigate noise in reward models

Zhang, C., Shen, W., Zhao, L., Zhang, X., Xu, X., Dou, W., and Bian, J. Policy filtration for rlhf to mitigate noise in reward models. InForty-second International Conference on Machine Learning, 2025a. Zhang, C., Shen, W., Zhao, L., Zhang, X., Xu, X., Dou, W., and Bian, J. Policy filtration for RLHF to mitigate noise in reward models. InForty-second Inte...

Pith/arXiv arXiv

[18] [18]

This dataset consists of 17,398 problems, primarily from competitive mathematics, with integer-based answers

for training. This dataset consists of 17,398 problems, primarily from competitive mathematics, with integer-based answers. The verified and binary reward structure (1.0 for a correct answer, 0.0 12 GEOALIGN: Geometric Rollout Curation for Robust LLM Reinforcement Learning otherwise) provides a clear, objective signal for improving logical and computation...

2022

[19] [19]

We utilize GRPO (Shao et al.,

via flash-attention (Dao et al., 2022). We utilize GRPO (Shao et al.,

2022

[20] [20]

Tie" should be extremely rare. Return ONLY a JSON object: {

( 1.7B-Base and 4B-Base) undergo supervised fine-tuning (SFT) on the HH-RLHF dataset to align with the desired conversational style. Subsequently, these SFT models are further optimized using RL. The reward signal is continuous, obtained by querying an external reward modelArmoRM(Wang et al., 2024). Given the rich, continuous feedback from the reward mode...

2024

[21] [21]

”Inst.” is an abbreviation for Instruct-tuned models

Table 6.Key hyperparameters for SFT and RL across different models and domains. ”Inst.” is an abbreviation for Instruct-tuned models. RLHF (SFT) RLHF (RL) Mathematics (RL) Hyperparameter 1.7B 4B 1.7B 4B 1.7B 4B Base Model Qwen3-1.7B-Base Qwen3-4B-Base SFT-tuned SFT-tuned Qwen3-1.7B-Inst. Qwen3-4B-Inst. Common Training Parameters Optimizer AdamW AdamW Adam...

2024

[22] [22]

aggregates these directions across pairs, so correct but creative solutions naturally shape the consensus. Rollouts are flagged as outliers only when their implied directions contradict this collective flow, typically due to reward hacking or logical inconsistency rather than genuine novelty. To directly test whether GEOALIGNsuppresses solution diversity,...

2024

[23] [23]

23.98 19.50 57.75 74.52 20.22 42.90 39.81 PF-PPO (BR) (Zhang et al., 2025a) 22.75 16.87 53.75 73.64 19.1943.50 38.28 PF-PPO (BW) (Zhang et al., 2025a) 22.83 19.38 55.37 73.40 19.41 42.31 38.78 PAR (Fu et al.,

1943

[24] [24]

Qwen3-4B23.33 20.42 65.63 77.20 21.32 44.89 42.13 BASE-GRPO (Shao et al.,

23.08 17.0859.7575.16 19.93 42.49 39.58 SEED-GRPO (Chen et al., 2025)24.2519.48 59.62 74.60 19.85 42.99 40.13 GEOALIGN(Ours)24.17 21.6758.5075.28 20.4042.62 40.44 Table 14.Comparison of different methods on 4B-Math BenchmarksAIME24 AIME25 AMC23 MATH500 Minerva Olympiadbench Avg. Qwen3-4B23.33 20.42 65.63 77.20 21.32 44.89 42.13 BASE-GRPO (Shao et al.,

arXiv 2025

[25] [25]

41.17 36.67 85.13 83.6826.7655.32 54.79 PF-PPO (BR) (Zhang et al., 2025a) 42.08 34.33 84.25 84.48 26.54 54.84 54.42 PF-PPO (BW) (Zhang et al., 2025a) 39.58 34.42 80.75 83.36 25.00 55.11 53.04 PAR (Fu et al.,

arXiv

[26] [26]

Complete Results of Overall Performance on Math We present the complete scores of Qwen3-1.7B and Qwen3-4B on HH-RLHF task in the validation experiments in Tables

43.00 35.92 85.50 84.52 25.81 56.09 55.14 SEED-GRPO (Chen et al., 2025)44.6737.58 83.75 84.56 26.69 56.18 55.57 GEOALIGN(Ours)43.00 40.33 85.62 84.6825.3756.65 55.94 D.2. Complete Results of Overall Performance on Math We present the complete scores of Qwen3-1.7B and Qwen3-4B on HH-RLHF task in the validation experiments in Tables

arXiv 2025

[27] [27]

0.8311 0.8397 0.8354 0.8562 0.8782 0.8672 + PF-PPO (BR) (Zhang et al., 2025a) 0.8630 0.8482 0.8556 0.84720.90690.8771 + PF-PPO (BW) (Zhang et al., 2025a) 0.8409 0.8383 0.8396 0.8462 0.8858 0.8660 + PAR (Fu et al.,

arXiv

[28] [28]

0.8330 0.8518 0.8424 0.8622 0.8864 0.8743 + GEOALIGN(Ours) 0.8806 0.8963 0.8885 0.87210.9067 0.8894 Table 16.Evaluation win-rate of different models on the HH-rlhf dataset Models Qwen3-1.7B Qwen3-4B GEOALIGNvs Baselines Win Lose Tie Win Lose Tie + BASE-GRPO (Shao et al.,

arXiv