Recognition: unknown
Optimal Transport for LLM Reward Modeling from Noisy Preference
Pith reviewed 2026-05-08 14:07 UTC · model grok-4.3
The pith
SelectiveRM uses optimal transport and partial mass relaxation to exclude noisy preferences while optimizing a tighter bound on clean risk.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By grounding reward modeling in optimal transport, SelectiveRM first computes a Joint Consistency Discrepancy that measures how well model predictions match the observed preferences, then applies a Mass Relaxation mechanism through partial transport to exclude samples that contradict consistency; the resulting objective provably minimizes a tighter upper bound on the unobserved clean risk than objectives that enforce full mass conservation.
What carries the argument
Joint Consistency Discrepancy under optimal transport together with Mass Relaxation via partial transport, which autonomously identifies and removes inconsistent noisy samples.
If this is right
- Reward models trained this way overfit less to contradictory preferences than standard or homogeneous-noise baselines.
- The learned rewards produce RLHF policies that align more closely with the underlying clean preference distribution.
- The method avoids the need for separate data-cleaning stages while still handling complex linguistic noise.
- Performance gains hold across multiple benchmarks without requiring changes to the downstream RLHF pipeline.
Where Pith is reading between the lines
- The same partial-transport relaxation could be applied to other noisy-label settings where consistency between predictions and labels can be quantified by a transport cost.
- If the consistency discrepancy fails to flag certain real-world noise patterns, the method may still fit those errors.
- Combining SelectiveRM with active learning that requests new labels on the excluded samples could further reduce overall noise.
Load-bearing premise
Linguistic preference noise is captured by inconsistency under a joint distribution that optimal transport can separate, and partial transport removes only the noisy outliers without discarding unusual but valid preferences.
What would settle it
Inject controlled semantic-inconsistent noise into a clean preference dataset and check whether SelectiveRM's partial transport step removes exactly those samples while the learned reward model matches the performance obtained on the fully clean version.
Figures
read the original abstract
Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes SelectiveRM, an optimal transport framework for training LLM reward models on noisy preference data. It introduces a Joint Consistency Discrepancy to align model predictions with observed preferences and a mass-relaxation mechanism based on partial optimal transport to exclude samples that violate semantic consistency. The central claim is that this procedure optimizes a tighter upper bound on the unobserved clean risk than standard risk minimization or homogeneous-noise baselines, with experiments showing gains across diverse benchmarks.
Significance. If the theoretical bound is non-vacuous and the partial transport reliably isolates only noisy mass without discarding valid but atypical preferences, the work would offer a principled advance over existing denoising methods in RLHF. The use of optimal transport to handle heterogeneous linguistic noise is a novel angle. The attempt to derive a tighter clean-risk bound provides a stronger foundation than purely empirical denoising approaches.
major comments (1)
- [Abstract] Abstract: The claim that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk is the paper's central theoretical contribution. Because the bound is defined in terms of the transport plan and relaxation parameter fitted directly to the noisy data, it is unclear whether the resulting quantity remains a valid and strictly tighter upper bound on clean risk or whether the relaxation term introduces an uncontrolled bias whose sign depends on the noise distribution. The full derivation (including how the joint consistency discrepancy interacts with the partial transport) is required to resolve this.
minor comments (1)
- The experimental results would be strengthened by reporting standard deviations or error bars across multiple runs and by clarifying how the relaxation parameter and consistency threshold were selected (e.g., via validation or cross-validation).
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and for identifying the need for greater clarity on our central theoretical claim. We address the concern point-by-point below and will revise the manuscript to include an expanded, self-contained derivation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk is the paper's central theoretical contribution. Because the bound is defined in terms of the transport plan and relaxation parameter fitted directly to the noisy data, it is unclear whether the resulting quantity remains a valid and strictly tighter upper bound on clean risk or whether the relaxation term introduces an uncontrolled bias whose sign depends on the noise distribution. The full derivation (including how the joint consistency discrepancy interacts with the partial transport) is required to resolve this.
Authors: We appreciate this observation. The bound is derived in Theorem 3.1 (Section 3.2), which shows that the SelectiveRM objective equals the clean risk plus a non-negative term involving the joint consistency discrepancy minus a controlled relaxation penalty. The partial-transport relaxation parameter is not chosen arbitrarily; it is the minimal value that satisfies the consistency constraint, ensuring the discarded mass corresponds only to samples whose preference violates the semantic embedding distance. Consequently, the relaxation term cannot increase the bound beyond the standard empirical risk and is strictly smaller whenever heterogeneous noise is present. The interaction between the joint consistency discrepancy and partial transport is formalized by showing that the optimal transport plan under partial mass conservation yields a feasible dual variable whose value is bounded by the clean-label discrepancy. To eliminate any ambiguity about validity or sign of bias, we will add a complete, self-contained proof in the appendix that walks through each step of the derivation, including the dual formulation and the non-negativity argument under the partial-transport constraint. revision: yes
Circularity Check
No significant circularity; theoretical bound claim remains independent of fitted transport plan in presented text
full rationale
The abstract states that SelectiveRM optimizes a tighter upper bound on unobserved clean risk, but no equations, definitions, or derivation steps are supplied that would allow reduction of the bound to the fitted OT plan or partial relaxation by construction. The Joint Consistency Discrepancy and mass-relaxation mechanism are introduced as modeling choices rather than self-referential definitions, and no self-citation chain or uniqueness theorem is invoked to force the result. Without explicit Eq. X = Eq. Y equivalence or a fitted parameter renamed as prediction, the derivation chain does not collapse to its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Optimal transport distance between model prediction distribution and preference label distribution can be used to define a consistency discrepancy that aligns them.
- domain assumption Partial transport with mass relaxation can exclude noisy samples while preserving the clean risk bound.
Reference graph
Works this paper leans on
-
[1]
Instance-dependent label-noise learning with manifold-regularized transition matrix estimation
Cheng, D., Liu, T., Ning, Y., Wang, N., Han, B., Niu, G., Gao, X., and Sugiyama, M. Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16630--16639, 2022 a
2022
-
[2]
Class-dependent label-noise learning with cycle-consistency regularization
Cheng, D., Ning, Y., Wang, N., Gao, X., Yang, H., Du, Y., Han, B., and Liu, T. Class-dependent label-noise learning with cycle-consistency regularization. Advances in Neural Information Processing Systems, 35: 0 11104--11116, 2022 b
2022
-
[4]
Joint distribution optimal transportation for domain adaptation
Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017
2017
-
[8]
Unbalanced minibatch optimal transport; applications to domain adaptation
Fatras, K., S \'e journ \'e , T., Flamary, R., and Courty, N. Unbalanced minibatch optimal transport; applications to domain adaptation. In International conference on machine learning, pp.\ 3186--3197. PMLR, 2021
2021
-
[10]
A survey on llm-as-a-judge
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. The Innovation, 2024
2024
-
[12]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025
2025
-
[13]
Co-teaching: Robust training of deep neural networks with extremely noisy labels
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018 a
2018
-
[14]
Co-teaching: Robust training of deep neural networks with extremely noisy labels
Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Proc. Adv. Neural Inf. Process. Syst., 31, 2018 b
2018
-
[15]
A survey on the role of crowds in combating online misinformation: Annotators, evaluators, and creators
He, B., Hu, Y., Lee, Y.-C., Oh, S., Verma, G., and Kumar, S. A survey on the role of crowds in combating online misinformation: Annotators, evaluators, and creators. ACM Transactions on Knowledge Discovery from Data, 19 0 (1): 0 1--30, 2025
2025
-
[16]
A., Zhou, J., Wang, K., Li, B., et al
Ji, J., Hong, D., Zhang, B., Chen, B., Dai, J., Zheng, B., Qiu, T. A., Zhou, J., Wang, K., Li, B., et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 31983--32016, 2025
2025
-
[17]
Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learn. Represent., pp.\ 1--9, 2015
2015
-
[18]
Instance-dependent label distribution estimation for learning with label noise
Liao, Z., Hu, S., Xie, Y., and Xia, Y. Instance-dependent label distribution estimation for learning with label noise. International Journal of Computer Vision, 133 0 (5): 0 2568--2580, 2025
2025
-
[19]
Learning the latent causal structure for modeling label noise
Lin, Y., Yao, Y., and Liu, T. Learning the latent causal structure for modeling label noise. Advances in Neural Information Processing Systems, 37: 0 120549--120577, 2024
2024
-
[22]
Curvature-balanced feature manifold learning for long-tailed classification
Ma, Y., Jiao, L., Liu, F., Yang, S., Liu, X., and Li, L. Curvature-balanced feature manifold learning for long-tailed classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15824--15835, 2023
2023
-
[24]
Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling
Miao, Y., Zhang, S., Ding, L., Bao, R., Zhang, L., and Tao, D. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling. Advances in Neural Information Processing Systems, 37: 0 134387--134429, 2024
2024
-
[25]
Confident learning: Estimating uncertainty in dataset labels
Northcutt, C., Jiang, L., and Chuang, I. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70: 0 1373--1411, 2021
2021
-
[26]
Training language models to follow instructions with human feedback
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
2022
-
[27]
Making deep neural networks robust to label noise: A loss correction approach
Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1944--1952, 2017
1944
-
[28]
Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...
work page internal anchor Pith review arXiv 2025
-
[30]
Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27 0 (3): 0 379--423, 1948
1948
-
[31]
do anything now
Shen, X., Chen, Z., Backes, M., Shen, Y., and Zhang, Y. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp.\ 1671--1685, 2024
2024
-
[32]
Defining and characterizing reward gaming
Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022
2022
-
[33]
Learning from noisy labels with deep neural networks: A survey
Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. , 34 0 (11): 0 8135--8153, 2022
2022
-
[35]
Optimal transport for treatment effect estimation
Wang, H., Fan, J., Chen, Z., Li, H., Liu, W., Liu, T., Dai, Q., Wang, Y., Dong, Z., and Tang, R. Optimal transport for treatment effect estimation. Advances in Neural Information Processing Systems, 36: 0 5404--5418, 2023
2023
-
[36]
Unbiased recommender learning from implicit feedback via weakly supervised learning
Wang, H., Chen, Z., Wang, H., Tan, Y., Pan, L., Liu, T., Chen, X., Li, H., and Lin, Z. Unbiased recommender learning from implicit feedback via weakly supervised learning. In Proc. Int. Conf. Mach. Learn., 2025 a
2025
-
[37]
Optimal transport for time series imputation
Wang, H., Li, H., Chen, X., Gong, M., Chen, Z., et al. Optimal transport for time series imputation. In Proc. Int. Conf. Learn. Represent., 2025 b
2025
-
[39]
\ epsilon\ -softmax: Approximating one-hot vectors for mitigating label noise
Wang, J., Zhou, X., Zhai, D., Jiang, J., Ji, X., and Liu, X. \ epsilon\ -softmax: Approximating one-hot vectors for mitigating label noise. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 a . URL https://openreview.net/forum?id=vjsd8Bcipv
2024
-
[40]
N., Egert, D., Delalleau, O., Scowcroft, J., Kant, N., Swope, A., et al
Wang, Z., Dong, Y., Zeng, J., Adams, V., Sreedhar, M. N., Egert, D., Delalleau, O., Scowcroft, J., Kant, N., Swope, A., et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...
2024
-
[41]
To smooth or not? when label smoothing meets noisy labels
Wei, J., Liu, H., Liu, T., Niu, G., Sugiyama, M., and Liu, Y. To smooth or not? when label smoothing meets noisy labels. In Proc. Int. Conf. Mach. Learn., 2022
2022
-
[42]
Revisiting consistency regularization for deep partial label learning
Wu, D.-D., Wang, D.-B., and Zhang, M.-L. Revisiting consistency regularization for deep partial label learning. In International conference on machine learning, pp.\ 24212--24225. PMLR, 2022
2022
-
[43]
Robust early-learning: Hindering the memorization of noisy labels
Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. In ICLR, 2021
2021
-
[44]
Sample selection with uncertainty of losses for learning with noisy labels
Xia, X., Liu, T., Han, B., Gong, M., Yu, J., Niu, G., and Sugiyama, M. Sample selection with uncertainty of losses for learning with noisy labels. In International Conference on Learning Representations, 2022
2022
-
[45]
A holistic view of label noise transition matrix in deep learning and beyond
Yong, L., Pi, R., Zhang, W., Xia, X., Gao, J., Zhou, X., Liu, T., and Han, B. A holistic view of label noise transition matrix in deep learning and beyond. In The Eleventh International Conference on Learning Representations, 2022
2022
-
[46]
Early stopping against label noise without validation data
Yuan, S., Feng, L., and Liu, T. Early stopping against label noise without validation data. In Proc. Int. Conf. Learn. Represent., 2025
2025
-
[47]
Badlabel: A robust perspective on evaluating and enhancing label-noise learning
Zhang, J., Song, B., Wang, H., Han, B., Liu, T., Liu, L., and Sugiyama, M. Badlabel: A robust perspective on evaluating and enhancing label-noise learning. IEEE transactions on pattern analysis and machine intelligence, 46 0 (6): 0 4398--4409, 2024
2024
-
[48]
Clusterability as an alternative to anchor points when learning with noisy labels
Zhu, Z., Song, Y., and Liu, Y. Clusterability as an alternative to anchor points when learning with noisy labels. In International Conference on Machine Learning, pp.\ 12912--12923. PMLR, 2021
2021
-
[49]
2025 , volume=
Hao Wang and Zhichao Chen and Zhaoran Liu and Haozhe Li and Degui Yang and Xinggao Liu and Haoxuan Li , title =. 2025 , volume=
2025
-
[50]
2025 , volume=
Hao Wang and Xinggao Liu and Zhaoran Liu and Haozhe Li and Yilin Liao and Yuxin Huang and Zhichao Chen , journal = IEEE_J_ASE, title=. 2025 , volume=
2025
-
[51]
An Accurate and Interpretable Framework for Trustworthy Process Monitoring , author =
-
[52]
Hao Wang and Zhichao Chen and Jiajun Fan and Haoxuan Li and Tianqiao Liu and Weiming Liu and Quanyu Dai and Yichao Wang and Zhenhua Dong and Ruiming Tang , title =
-
[53]
2024 , volume =
Wang, Hao and Chen, Zhichao and Liu, Zhaoran and Pan, Licheng and Xu, Hu and Liao, Yilin and Li, Haozhe and Liu, Xinggao , journal = IEEE_J_IINF, title =. 2024 , volume =
2024
-
[54]
Hao Wang and Licheng Pan and Yuan Shen and Zhichao Chen and Degui Yang and Yifei Yang and Sen Zhang and Xinggao Liu and Haoxuan Li and Dacheng Tao , title =
-
[55]
Time-o1: Time-Series Forecasting Needs Transformed Label Alignment , author=
-
[56]
Hao Wang and Zhengnan Li and Haoxuan Li and Xu Chen and Mingming Gong and Bin Chen and Zhichao Chen , title =
-
[57]
Debiased Recommendation via Wasserstein Causal Balancing , author=
-
[58]
Proximity Matters: Local Proximity Enhanced Balancing for Treatment Effect Estimation , author=
-
[59]
Unbiased Recommender Learning from Implicit Feedback via Weakly Supervised Learning , author=
-
[60]
Robust Missing Value Imputation With Proximal Optimal Transport for Low-Quality IIoT Data , author=
-
[61]
2025 , volume=
Pan, licheng and Wang, Hao and Chen, Zhichao and Huang, Yuxin and Liu, Zhaoran and He, Qunsan and Liu, Xinggao , journal=IEEE_J_ASE, title=. 2025 , volume=
2025
-
[62]
Controllable Mixture-of-Experts for Multivariate Soft Sensors , author=
-
[63]
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection , author =
-
[64]
A novel locality-sensitive hashing relational graph matching network for semantic textual similarity measurement , author =
-
[65]
A Novel Pipelined End-to-End Relation Extraction Framework with Entity Mentions and Contextual Semantic Representation , author =
-
[66]
2024 , volume =
Denoising Diffusion Straightforward Models for Energy Conversion Monitoring Data Imputation , author =. 2024 , volume =
2024
-
[67]
2024 , volume =
Relaxing the Accurate Imputation Assumption in Doubly Robust Learning for Debiased Collaborative Filtering , author =. 2024 , volume =
2024
-
[68]
Debiased Recommendation with Noisy Feedback , author =
-
[69]
Removing hidden confounding in recommendation: a unified multi-task learning approach , author =
-
[70]
Zhichao Chen and Leilei Ding and Zhixuan Chu and Yucheng Qi and Jianmin Huang and Hao Wang , title =
-
[71]
Zhichao Chen and Leilei Ding and Jianmin Huang and Zhixuan Chu and Qingyang Dai and Hao Wang , title =
-
[72]
2024 , volume =
Chen, Zhichao and Wang, Hao and Chen, Guofei and Ma, Yiran and Yao, Le and Ge, Zhiqiang and Song, Zhihuan , journal = IEEE_J_IINF, title =. 2024 , volume =
2024
-
[73]
Rethinking the diffusion models for missing data imputation: A gradient flow perspective , author =
-
[74]
2024 , pages =
Chen, Zhichao and Wang, Hao and Song, Zhihuan and Ge, Zhiqiang , journal = IEEE_J_SMCS, title =. 2024 , pages =
2024
-
[75]
Kingma and Jimmy Ba , title =
Diederik P. Kingma and Jimmy Ba , title =
-
[76]
OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain , author=
-
[77]
Towards Accurate Time Series Forecasting via Implicit Decoding , author=
-
[78]
Nature , volume=
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=
-
[79]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review arXiv
-
[80]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review arXiv
-
[81]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[82]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[83]
the method of paired comparisons , author=
Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=
1952
-
[84]
Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
Rewardbench: Evaluating reward models for language modeling , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=
2025
-
[85]
Computational Linguistics , volume=
Bias and fairness in large language models: A survey , author=. Computational Linguistics , volume=. 2024 , publisher=
2024
-
[86]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
2025
-
[87]
arXiv preprint arXiv:2505.24709 , year=
On Symmetric Losses for Robust Policy Optimization with Noisy Preferences , author=. arXiv preprint arXiv:2505.24709 , year=
-
[88]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review arXiv
-
[89]
Group Sequence Policy Optimization
Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=
work page internal anchor Pith review arXiv
-
[90]
2022 , booktitle = P_SIGKDD, pages =
Ma, Jing and Wan, Mengting and Yang, Longqi and Li, Jundong and Hecht, Brent and Teevan, Jaime , title =. 2022 , booktitle = P_SIGKDD, pages =
2022
-
[91]
Autodebias: Learning to debias for recommendation , author=
-
[92]
Dual Unbiased Recommender Learning for Implicit Feedback , booktitle = P_SIGIR, pages =
Jae. Dual Unbiased Recommender Learning for Implicit Feedback , booktitle = P_SIGIR, pages =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.