Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

Guozheng Li; Xiyan Fu; Yiwen Guo

arxiv: 2606.10528 · v1 · pith:WD3VKSROnew · submitted 2026-06-09 · 💻 cs.LG · cs.CL

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

Guozheng Li , Xiyan Fu , Yiwen Guo This is my paper

Pith reviewed 2026-06-27 13:37 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learning from human feedbackadvantage estimationreward modelrepresentation learninggraph propagationRLHFpreference alignmentsample efficiency

0 comments

The pith

Reward model hidden states, modeled as graphs of response similarity, produce better advantage estimates than scalar rewards alone in RLHF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that scalar rewards from reward models are noisy and miss fine preference distinctions, while the same models' hidden states hold richer semantic signals. It proposes treating groups of responses as graphs with edges set by hidden-state similarity, then computing advantages through propagation so each sample draws context from its neighbors. This GraphAE method slots into existing group-based RL algorithms without major changes. A sympathetic reader would care because it promises to make RLHF training more efficient and stable using information the reward model already computes.

Core claim

The paper establishes that representation-aware advantage estimation, implemented as Graph-based Advantage Estimation, models each sampled group as a graph whose nodes are responses and whose edges reflect similarity in the reward model's hidden space; advantages are then obtained by propagating information across these edges, allowing each sample to incorporate contextual signals from neighbors and yielding more accurate estimates than scalar rewards alone.

What carries the argument

Graph-based Advantage Estimation (GraphAE), which constructs a graph over responses using similarity edges in reward-model hidden space and computes advantages by propagation across those edges.

If this is right

GraphAE integrates into GRPO, GSPO and RLOO and produces consistent gains on Arena-Hard-v0.1, AlpacaEval 2.0 and MT-Bench.
RLHF training becomes more sample-efficient because each response benefits from contextual information drawn from similar neighbors.
The method remains lightweight and requires only the hidden states already produced by a standard reward model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reward-model training objectives could be extended to encourage richer internal representations rather than scalar accuracy alone.
The same graph-propagation idea might apply to other reinforcement-learning settings that already compute auxiliary representations.
Different choices of similarity metric or propagation rule could be tested to further refine the advantage estimates.

Load-bearing premise

Reward-model hidden states contain preference information that can be captured by similarity edges and usefully propagated to improve advantage estimates.

What would settle it

An experiment that applies the graph-propagation procedure to the same RLHF setups and records no improvement or a decline on the three reported benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10528 by Guozheng Li, Xiyan Fu, Yiwen Guo.

**Figure 1.** Figure 1: The same RM may assign similar or even reversed scalar scores to opposite responses, while its hidden representations clearly separate them. GT denotes ground truth and RM denotes reward model. pipelines rely on a scalar reward produced by a reward model (RM) to guide policy updates via policy gradient methods (Sutton et al., 1999). Many policy optimization methods have been developed to improve the sta… view at source ↗

**Figure 2.** Figure 2: Overview of GraphAE. GraphAE refines advantages by leveraging RM representation structures, acting as [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation of graph construction strategies. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Denoising effect of GraphAE under GRPO. Baseline Reward std. Reward Gain Positive Rate <2.5 -1.4 38.1% 2.5∼5.0 -1.0 45.4% 5.0∼7.5 +0.6 54.0% >7.5 +10.4 70.8% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Checkpoint sample efficiency analysis. G = 8, and 15.27 ms at G = 32, while the scratch memory stays below 21 KB per group. Since the estimator only solves a small G × G linear system after reward computation, its cost is negligible relative to policy and RM forward passes. Taken together, these results show that GraphAE is both more sample efficient and lightweight in practice. Checkpoint Efficiency. To … view at source ↗

**Figure 8.** Figure 8: Same budget training dynamics of group reward std. over the first 30k training steps on GSPO. 0 10k 20k 30k 0.02 0.04 0.06 0.08 Reward Std. Qwen2.5-7B-Instruct Llama-3-8B-Instruct RLOO RLOO+GraphAE [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Same budget training dynamics of group reward std. over the first 30k training steps on RLOO. training, where vanilla GSPO and RLOO exhibit larger fluctuations in group rewards. By propagating rewards over the RM representation graph, GraphAE produces smoother within-group reward signals while keeping the underlying optimization procedure unchanged. These results complement the GRPO analysis and suggest t… view at source ↗

read the original abstract

Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphAE builds graphs from RM hidden states to propagate advantages in group RLHF methods and reports benchmark gains, but the gains may stem from smoothing rather than richer signals.

read the letter

The main takeaway is that this paper takes reward model hidden states, turns each group of responses into a graph with edges set by hidden-state similarity, and propagates advantages across those edges. They plug the resulting estimator into GRPO, GSPO, and RLOO and show gains on Arena-Hard, AlpacaEval, and MT-Bench.

What is actually new is the explicit graph construction and propagation step that treats the RM representations as auxiliary signals rather than just using the scalar output. The method is lightweight and requires no extra training, which is a practical strength if it holds up.

The results are presented as consistent improvements, with the largest deltas on the harder benchmarks. That is the part worth noting for anyone already running group-based RLHF.

The soft spot is the missing isolation. The claim rests on the idea that hidden-state similarity carries preference information orthogonal to the scalar reward. Without ablations that compare against random graphs, policy embeddings, or simple distance metrics on the scalar rewards themselves, it is hard to rule out that any neighbor averaging would produce similar smoothing benefits. The abstract also omits error bars, statistical tests, and hyperparameter details, so the reported deltas are difficult to evaluate for robustness.

This is aimed at the RLHF and preference-optimization crowd. Someone already experimenting with GRPO-style methods might get value from the code if it is released, even if they end up modifying the graph construction.

I would send it to peer review. The idea is simple enough to test quickly and the benchmarks are the right ones, but the paper will need clearer controls and statistics before the central claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper claims that reward model (RM) hidden states contain richer semantic and preference information than their scalar outputs alone. It introduces Graph-based Advantage Estimation (GraphAE), which constructs a graph over responses in each sampled group with edges defined by hidden-state similarity and computes advantages via graph propagation. The method is integrated into GRPO, GSPO, and RLOO, with reported gains of up to +6.3 on Arena-Hard-v0.1, +8.27 on AlpacaEval 2.0, and +0.22 on MT-Bench, arguing for more sample-efficient and robust RLHF.

Significance. If the gains arise specifically from preference information encoded in RM representations (rather than generic smoothing), the approach would meaningfully improve existing group-based RLHF pipelines by extracting additional signal from already-trained reward models without extra training cost.

major comments (2)

[Section 4] Experiments (Section 4): The reported benchmark improvements lack any ablation that isolates the contribution of RM hidden-state similarity; there are no controls using random graphs, policy-embedding graphs, or scalar-reward-distance graphs. Without these, it is impossible to determine whether the observed deltas (+6.3 Arena-Hard, etc.) result from richer preference information or from the addition of any propagation operator.
[Section 3] Method (Section 3): The graph-construction and propagation procedure is presented without reporting edge-density statistics, the precise similarity metric, or hyperparameter sensitivity; the central claim that hidden states encode information orthogonal to the scalar RM output therefore rests on an untested modeling assumption rather than a controlled demonstration.

minor comments (2)

[Abstract] The abstract and introduction repeatedly use the phrase "consistent improvements" without defining consistency (e.g., across seeds, models, or runs).
[Section 4] Implementation details (number of runs, statistical tests, error bars) are absent from the experimental description, making reproducibility and significance assessment difficult.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight opportunities to strengthen the experimental controls and methodological details, which we address point by point below. We will incorporate revisions to improve clarity and rigor.

read point-by-point responses

Referee: [Section 4] Experiments (Section 4): The reported benchmark improvements lack any ablation that isolates the contribution of RM hidden-state similarity; there are no controls using random graphs, policy-embedding graphs, or scalar-reward-distance graphs. Without these, it is impossible to determine whether the observed deltas (+6.3 Arena-Hard, etc.) result from richer preference information or from the addition of any propagation operator.

Authors: We agree that the current experiments do not include the specific control ablations suggested (random graphs, policy-embedding graphs, or scalar-reward-distance graphs). The manuscript reports improvements relative to the base group-based RL algorithms (GRPO, GSPO, RLOO) without the graph propagation step, which provides a baseline for the effect of adding the operator. However, these controls do not fully isolate whether gains derive from RM-specific semantic information versus generic propagation. We will add the requested ablations in the revised version to more directly support the claim that RM hidden states provide orthogonal preference information. revision: yes
Referee: [Section 3] Method (Section 3): The graph-construction and propagation procedure is presented without reporting edge-density statistics, the precise similarity metric, or hyperparameter sensitivity; the central claim that hidden states encode information orthogonal to the scalar RM output therefore rests on an untested modeling assumption rather than a controlled demonstration.

Authors: We acknowledge that the manuscript does not report edge-density statistics, the exact similarity metric (e.g., cosine similarity on hidden states), or hyperparameter sensitivity results. These details will be added to Section 3 in the revision, along with sensitivity analysis, to make the procedure fully reproducible and to provide empirical support for the modeling assumption that hidden states capture information beyond the scalar reward. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a self-contained empirical proposal.

full rationale

The paper introduces GraphAE as a novel graph-propagation technique on RM hidden-state similarities, applied to existing RL algorithms like GRPO. The abstract and described approach contain no equations, derivations, or self-citations that reduce the claimed advantage estimates to fitted inputs or prior results by construction. Benchmark gains are presented as empirical outcomes, not tautological predictions. The central claim rests on an untested modeling assumption rather than a definitional loop, qualifying as a standard non-circular method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that RM hidden states contain usable preference structure beyond the scalar reward; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption Reward model hidden states encode richer semantic and preference information than the scalar output alone.
Explicitly stated as motivation in the abstract.

invented entities (1)

GraphAE no independent evidence
purpose: Advantage estimation via graph propagation on RM hidden-state similarities.
New method introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5751 in / 1186 out tokens · 17332 ms · 2026-06-27T13:37:36.920296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 13 linked inside Pith

[1]

Proceedings of NeurIPS , year=

Deep reinforcement learning from human preferences , author=. Proceedings of NeurIPS , year=
[2]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2209.07858 , year=

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned , author=. arXiv preprint arXiv:2209.07858 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[6]

arXiv preprint arXiv:2507.01352 , year=

Skywork-reward-v2: Scaling preference data curation via human-ai synergy , author=. arXiv preprint arXiv:2507.01352 , year=

Pith/arXiv arXiv
[7]

Proceedings of ICML , year=

Ultrafeedback: Boosting language models with scaled ai feedback , author=. Proceedings of ICML , year=
[8]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[9]

Proceedings of ICML , year=

Semi-supervised learning using Gaussian fields and harmonic functions , author=. Proceedings of ICML , year=
[10]

Proceedings of COLT , year=

Kernels and regularization on graphs , author=. Proceedings of COLT , year=
[11]

arXiv preprint arXiv:2507.18071 , year=

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

Pith/arXiv arXiv
[12]

Proceedings of ACL , year=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of ACL , year=
[13]

arXiv preprint arXiv:2406.11939 , year=

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline , author=. arXiv preprint arXiv:2406.11939 , year=

Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2404.04475 , year=

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

Pith/arXiv arXiv
[15]

Proceedings of NeurIPS , year=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Proceedings of NeurIPS , year=
[16]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952
[17]

2023 , booktitle=

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , author=. 2023 , booktitle=

2023
[18]

arXiv preprint arXiv:2310.00212 , year=

Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment , author=. arXiv preprint arXiv:2310.00212 , year=

arXiv
[19]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

1998
[20]

Proceedings of NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of NeurIPS , year=
[21]

Proceedings of NeurIPS , year=

Simpo: Simple preference optimization with a reference-free reward , author=. Proceedings of NeurIPS , year=
[22]

arXiv preprint arXiv:2402.01306 , year=

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

Pith/arXiv arXiv
[23]

Proceedings of EMNLP , year=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of EMNLP , year=
[24]

arXiv preprint arXiv:2404.10719 , year=

Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=

arXiv
[25]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2404.02078 , year=

Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=

arXiv
[28]

Proceedings of NeurIPS , year=

Policy gradient methods for reinforcement learning with function approximation , author=. Proceedings of NeurIPS , year=
[29]

Proceedings of NeurIPS , year=

Training language models to follow instructions with human feedback , author=. Proceedings of NeurIPS , year=
[30]

Proceedings of ACL , year=

Dialogpt: Large-scale generative pre-training for conversational response generation , author=. Proceedings of ACL , year=
[31]

Proceedings of NeurIPS , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. Proceedings of NeurIPS , year=
[32]

arXiv preprint arXiv:2502.18770 , year=

Reward shaping to mitigate reward hacking in rlhf , author=. arXiv preprint arXiv:2502.18770 , year=

arXiv
[33]

arXiv preprint arXiv:2303.00001 , year=

Reward design with language models , author=. arXiv preprint arXiv:2303.00001 , year=

arXiv
[34]

Proceedings of ICML , year=

Policy filtration for rlhf to mitigate noise in reward models , author=. Proceedings of ICML , year=
[35]

Proceedings of AAAI , year=

Interpretable reward model via sparse autoencoder , author=. Proceedings of AAAI , year=
[36]

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong. Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. Findings of EMNLP. 2024

2024
[37]

arXiv preprint arXiv:2401.06080 , year=

Secrets of rlhf in large language models part ii: Reward modeling , author=. arXiv preprint arXiv:2401.06080 , year=

arXiv
[38]

Proceedings of NeurIPS , year=

Regularizing hidden states enables learning generalizable reward model for llms , author=. Proceedings of NeurIPS , year=
[39]

arXiv preprint arXiv:2410.04503 , year=

LRHP: Learning Representations for Human Preferences via Preference Pairs , author=. arXiv preprint arXiv:2410.04503 , year=

arXiv
[40]

Proceedings of NeurIPS , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. Proceedings of NeurIPS , year=
[41]

arXiv preprint arXiv:2503.20783 , year=

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

Pith/arXiv arXiv

[1] [1]

Proceedings of NeurIPS , year=

Deep reinforcement learning from human preferences , author=. Proceedings of NeurIPS , year=

[2] [2]

arXiv preprint arXiv:2212.08073 , year=

Constitutional ai: Harmlessness from ai feedback , author=. arXiv preprint arXiv:2212.08073 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2209.07858 , year=

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned , author=. arXiv preprint arXiv:2209.07858 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:1707.06347 , year=

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[6] [6]

arXiv preprint arXiv:2507.01352 , year=

Skywork-reward-v2: Scaling preference data curation via human-ai synergy , author=. arXiv preprint arXiv:2507.01352 , year=

Pith/arXiv arXiv

[7] [7]

Proceedings of ICML , year=

Ultrafeedback: Boosting language models with scaled ai feedback , author=. Proceedings of ICML , year=

[8] [8]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[9] [9]

Proceedings of ICML , year=

Semi-supervised learning using Gaussian fields and harmonic functions , author=. Proceedings of ICML , year=

[10] [10]

Proceedings of COLT , year=

Kernels and regularization on graphs , author=. Proceedings of COLT , year=

[11] [11]

arXiv preprint arXiv:2507.18071 , year=

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

Pith/arXiv arXiv

[12] [12]

Proceedings of ACL , year=

Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs , author=. Proceedings of ACL , year=

[13] [13]

arXiv preprint arXiv:2406.11939 , year=

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline , author=. arXiv preprint arXiv:2406.11939 , year=

Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2404.04475 , year=

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

Pith/arXiv arXiv

[15] [15]

Proceedings of NeurIPS , year=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Proceedings of NeurIPS , year=

[16] [16]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952

[17] [17]

2023 , booktitle=

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization , author=. 2023 , booktitle=

2023

[18] [18]

arXiv preprint arXiv:2310.00212 , year=

Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment , author=. arXiv preprint arXiv:2310.00212 , year=

arXiv

[19] [19]

1998 , publisher=

Reinforcement learning: An introduction , author=. 1998 , publisher=

1998

[20] [20]

Proceedings of NeurIPS , year=

Direct preference optimization: Your language model is secretly a reward model , author=. Proceedings of NeurIPS , year=

[21] [21]

Proceedings of NeurIPS , year=

Simpo: Simple preference optimization with a reference-free reward , author=. Proceedings of NeurIPS , year=

[22] [22]

arXiv preprint arXiv:2402.01306 , year=

Kto: Model alignment as prospect theoretic optimization , author=. arXiv preprint arXiv:2402.01306 , year=

Pith/arXiv arXiv

[23] [23]

Proceedings of EMNLP , year=

Orpo: Monolithic preference optimization without reference model , author=. Proceedings of EMNLP , year=

[24] [24]

arXiv preprint arXiv:2404.10719 , year=

Is dpo superior to ppo for llm alignment? a comprehensive study , author=. arXiv preprint arXiv:2404.10719 , year=

arXiv

[25] [25]

arXiv preprint arXiv:2305.20050 , year=

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2404.02078 , year=

Advancing llm reasoning generalists with preference trees , author=. arXiv preprint arXiv:2404.02078 , year=

arXiv

[28] [28]

Proceedings of NeurIPS , year=

Policy gradient methods for reinforcement learning with function approximation , author=. Proceedings of NeurIPS , year=

[29] [29]

Proceedings of NeurIPS , year=

Training language models to follow instructions with human feedback , author=. Proceedings of NeurIPS , year=

[30] [30]

Proceedings of ACL , year=

Dialogpt: Large-scale generative pre-training for conversational response generation , author=. Proceedings of ACL , year=

[31] [31]

Proceedings of NeurIPS , year=

Chain-of-thought prompting elicits reasoning in large language models , author=. Proceedings of NeurIPS , year=

[32] [32]

arXiv preprint arXiv:2502.18770 , year=

Reward shaping to mitigate reward hacking in rlhf , author=. arXiv preprint arXiv:2502.18770 , year=

arXiv

[33] [33]

arXiv preprint arXiv:2303.00001 , year=

Reward design with language models , author=. arXiv preprint arXiv:2303.00001 , year=

arXiv

[34] [34]

Proceedings of ICML , year=

Policy filtration for rlhf to mitigate noise in reward models , author=. Proceedings of ICML , year=

[35] [35]

Proceedings of AAAI , year=

Interpretable reward model via sparse autoencoder , author=. Proceedings of AAAI , year=

[36] [36]

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts

Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong. Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts. Findings of EMNLP. 2024

2024

[37] [37]

arXiv preprint arXiv:2401.06080 , year=

Secrets of rlhf in large language models part ii: Reward modeling , author=. arXiv preprint arXiv:2401.06080 , year=

arXiv

[38] [38]

Proceedings of NeurIPS , year=

Regularizing hidden states enables learning generalizable reward model for llms , author=. Proceedings of NeurIPS , year=

[39] [39]

arXiv preprint arXiv:2410.04503 , year=

LRHP: Learning Representations for Human Preferences via Preference Pairs , author=. arXiv preprint arXiv:2410.04503 , year=

arXiv

[40] [40]

Proceedings of NeurIPS , year=

Dapo: An open-source llm reinforcement learning system at scale , author=. Proceedings of NeurIPS , year=

[41] [41]

arXiv preprint arXiv:2503.20783 , year=

Understanding r1-zero-like training: A critical perspective , author=. arXiv preprint arXiv:2503.20783 , year=

Pith/arXiv arXiv