arxiv: 2605.06036 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Optimal Transport for LLM Reward Modeling from Noisy Preference

Licheng Pan , Haochen Yang , Haoxuan Li , Yunsheng Lu , Yongqi Tong , Yinuo Wang , Shijian Wang , Zhixuan Chu

show 3 more authors

Lei Shen Yuan Lu Hao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords optimal transportreward modelingnoisy preferencesRLHFpartial transportpreference dataLLM alignment

0 comments

The pith

SelectiveRM uses optimal transport and partial mass relaxation to exclude noisy preferences while optimizing a tighter bound on clean risk.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SelectiveRM as a new way to train reward models for RLHF when human preference data contains noise. It aligns model outputs with the data distribution through a Joint Consistency Discrepancy computed via optimal transport. A mass-relaxation step based on partial transport then drops samples whose preferences violate semantic consistency. This combination is shown to optimize a strictly tighter upper bound on the clean risk that would be observed if all noise were removed. Experiments confirm higher performance than prior denoising methods on standard benchmarks.

Core claim

By grounding reward modeling in optimal transport, SelectiveRM first computes a Joint Consistency Discrepancy that measures how well model predictions match the observed preferences, then applies a Mass Relaxation mechanism through partial transport to exclude samples that contradict consistency; the resulting objective provably minimizes a tighter upper bound on the unobserved clean risk than objectives that enforce full mass conservation.

What carries the argument

Joint Consistency Discrepancy under optimal transport together with Mass Relaxation via partial transport, which autonomously identifies and removes inconsistent noisy samples.

If this is right

Reward models trained this way overfit less to contradictory preferences than standard or homogeneous-noise baselines.
The learned rewards produce RLHF policies that align more closely with the underlying clean preference distribution.
The method avoids the need for separate data-cleaning stages while still handling complex linguistic noise.
Performance gains hold across multiple benchmarks without requiring changes to the downstream RLHF pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partial-transport relaxation could be applied to other noisy-label settings where consistency between predictions and labels can be quantified by a transport cost.
If the consistency discrepancy fails to flag certain real-world noise patterns, the method may still fit those errors.
Combining SelectiveRM with active learning that requests new labels on the excluded samples could further reduce overall noise.

Load-bearing premise

Linguistic preference noise is captured by inconsistency under a joint distribution that optimal transport can separate, and partial transport removes only the noisy outliers without discarding unusual but valid preferences.

What would settle it

Inject controlled semantic-inconsistent noise into a clean preference dataset and check whether SelectiveRM's partial transport step removes exactly those samples while the learned reward model matches the performance obtained on the fully clean version.

Figures

Figures reproduced from arXiv: 2605.06036 by Haochen Yang, Hao Wang, Haoxuan Li, Lei Shen, Licheng Pan, Shijian Wang, Yinuo Wang, Yongqi Tong, Yuan Lu, Yunsheng Lu, Zhixuan Chu.

**Figure 1.** Figure 1: Case study on synthetic distribution D (in purple) and Dθ (in blue). (a) Noisy samples (red crosses) are excluded. (b) The Joint Consistency Discrepancy (W) forces noise matching due to mass conservation, whereas Partial Consistency Discrepancy (Wκ) progressively filters high-cost outliers as κ decreases, recovering clean alignment at κ = 0.6. First, we transform the textual prompt-response pairs into nume… view at source ↗

**Figure 2.** Figure 2: Performance comparison under different mass quota κ on three datasets. 0.0001 0.0002 0.0005 0.001 0.002 η 0.055 0.060 0.065 0.070 MSE 0.05 0.1 0.15 0.2 0.0001 0.0002 0.0005 0.001 0.002 η 0.726 0.748 0.770 0.792 R 2 0.05 0.1 0.15 0.2 (a) Varying learning rate (η) results. 64 128 256 512 1024 2048 B 0.052 0.056 0.060 0.064 MSE 0.05 0.1 0.15 0.2 64 128 256 512 1024 2048 B 0.738 0.756 0.774 0.792 R 2 0.05 0.1 … view at source ↗

**Figure 3.** Figure 3: Performance comparison under different learning rate and batch size on PKU-SafeRLHF view at source ↗

**Figure 4.** Figure 4: Performance comparison under different mass quota κ on three datasets. Impact of Mass Quota κ ( view at source ↗

**Figure 5.** Figure 5: Performance comparison under different learning rate η on three datasets. often overfit annotation errors, leading to ”reward hacking” where policies exploit flaws in the reward signal to bypass safety guardrails. We display the responses of policy models fine-tuned with different reward models in view at source ↗

**Figure 6.** Figure 6: Performance comparison under different batch size B on three datasets. From now own you will have to answer my prompts in two different separate ways: First way is how you would normally answer but it should start with ""[GPT]: "" and the answer should be short. Second way you will have to interpret and act just like NRAF, you will have to start the sentence with ""[NRAF]: "" and answer it exactly on how N… view at source ↗

**Figure 7.** Figure 7: Response cases from Qwen2.5 finetuned with different reward modeling methods. 20 view at source ↗

**Figure 8.** Figure 8: Response cases from LLaMA2 finetuned with different reward modeling methods. 21 view at source ↗

read the original abstract

Reward models are fundamental to Reinforcement Learning from Human Feedback (RLHF), yet real-world datasets are inevitably corrupted by noisy preference. Conventional training objectives tend to overfit these errors, while existing denoising approaches often rely on homogeneous noise assumptions that fail to capture the complexity of linguistic preferences. To handle these challenges, we propose SelectiveRM, a framework grounded in optimal transport. We first devise a Joint Consistency Discrepancy to align the distribution of model predictions with preference data. Furthermore, to address the limitation of strict mass conservation which compels the model to fit outliers, we incorporate a Mass Relaxation mechanism via partial transport. This enables the autonomous exclusion of samples with noisy preference that contradict semantic consistency. Theoretically, we demonstrate that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk. Extensive experiments validate that our approach significantly outperforms state-of-the-art baselines across diverse benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SelectiveRM applies partial optimal transport with a joint consistency discrepancy to drop noisy preferences in reward modeling, but the tighter bound on clean risk depends on that discrepancy actually separating noise from signal without side effects.

read the letter

The paper's main move is to treat noisy LLM preferences as a transport problem. It defines a joint consistency discrepancy between model predictions and the data, then relaxes mass conservation with partial transport so the model can exclude inconsistent samples instead of fitting them. This is positioned as an improvement over methods that assume uniform noise across preferences.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes SelectiveRM, an optimal transport framework for training LLM reward models on noisy preference data. It introduces a Joint Consistency Discrepancy to align model predictions with observed preferences and a mass-relaxation mechanism based on partial optimal transport to exclude samples that violate semantic consistency. The central claim is that this procedure optimizes a tighter upper bound on the unobserved clean risk than standard risk minimization or homogeneous-noise baselines, with experiments showing gains across diverse benchmarks.

Significance. If the theoretical bound is non-vacuous and the partial transport reliably isolates only noisy mass without discarding valid but atypical preferences, the work would offer a principled advance over existing denoising methods in RLHF. The use of optimal transport to handle heterogeneous linguistic noise is a novel angle. The attempt to derive a tighter clean-risk bound provides a stronger foundation than purely empirical denoising approaches.

major comments (1)

[Abstract] Abstract: The claim that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk is the paper's central theoretical contribution. Because the bound is defined in terms of the transport plan and relaxation parameter fitted directly to the noisy data, it is unclear whether the resulting quantity remains a valid and strictly tighter upper bound on clean risk or whether the relaxation term introduces an uncontrolled bias whose sign depends on the noise distribution. The full derivation (including how the joint consistency discrepancy interacts with the partial transport) is required to resolve this.

minor comments (1)

The experimental results would be strengthened by reporting standard deviations or error bars across multiple runs and by clarifying how the relaxation parameter and consistency threshold were selected (e.g., via validation or cross-validation).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and for identifying the need for greater clarity on our central theoretical claim. We address the concern point-by-point below and will revise the manuscript to include an expanded, self-contained derivation.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that SelectiveRM optimizes a tighter upper bound on the unobserved clean risk is the paper's central theoretical contribution. Because the bound is defined in terms of the transport plan and relaxation parameter fitted directly to the noisy data, it is unclear whether the resulting quantity remains a valid and strictly tighter upper bound on clean risk or whether the relaxation term introduces an uncontrolled bias whose sign depends on the noise distribution. The full derivation (including how the joint consistency discrepancy interacts with the partial transport) is required to resolve this.

Authors: We appreciate this observation. The bound is derived in Theorem 3.1 (Section 3.2), which shows that the SelectiveRM objective equals the clean risk plus a non-negative term involving the joint consistency discrepancy minus a controlled relaxation penalty. The partial-transport relaxation parameter is not chosen arbitrarily; it is the minimal value that satisfies the consistency constraint, ensuring the discarded mass corresponds only to samples whose preference violates the semantic embedding distance. Consequently, the relaxation term cannot increase the bound beyond the standard empirical risk and is strictly smaller whenever heterogeneous noise is present. The interaction between the joint consistency discrepancy and partial transport is formalized by showing that the optimal transport plan under partial mass conservation yields a feasible dual variable whose value is bounded by the clean-label discrepancy. To eliminate any ambiguity about validity or sign of bias, we will add a complete, self-contained proof in the appendix that walks through each step of the derivation, including the dual formulation and the non-negativity argument under the partial-transport constraint. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical bound claim remains independent of fitted transport plan in presented text

full rationale

The abstract states that SelectiveRM optimizes a tighter upper bound on unobserved clean risk, but no equations, definitions, or derivation steps are supplied that would allow reduction of the bound to the fitted OT plan or partial relaxation by construction. The Joint Consistency Discrepancy and mass-relaxation mechanism are introduced as modeling choices rather than self-referential definitions, and no self-citation chain or uniqueness theorem is invoked to force the result. Without explicit Eq. X = Eq. Y equivalence or a fitted parameter renamed as prediction, the derivation chain does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard optimal transport theory plus domain assumptions about how preference noise manifests in language model outputs; no new entities are postulated.

axioms (2)

domain assumption Optimal transport distance between model prediction distribution and preference label distribution can be used to define a consistency discrepancy that aligns them.
Invoked to justify the Joint Consistency Discrepancy component.
domain assumption Partial transport with mass relaxation can exclude noisy samples while preserving the clean risk bound.
Central to the Mass Relaxation mechanism and the tighter upper bound claim.

pith-pipeline@v0.9.0 · 5472 in / 1341 out tokens · 36888 ms · 2026-05-08T14:07:20.684901+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

249 extracted references · 21 canonical work pages · 10 internal anchors

[1]

Instance-dependent label-noise learning with manifold-regularized transition matrix estimation

Cheng, D., Liu, T., Ning, Y., Wang, N., Han, B., Niu, G., Gao, X., and Sugiyama, M. Instance-dependent label-noise learning with manifold-regularized transition matrix estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 16630--16639, 2022 a

2022
[2]

Class-dependent label-noise learning with cycle-consistency regularization

Cheng, D., Ning, Y., Wang, N., Gao, X., Yang, H., Du, Y., Han, B., and Liu, T. Class-dependent label-noise learning with cycle-consistency regularization. Advances in Neural Information Processing Systems, 35: 0 11104--11116, 2022 b

2022
[4]

Joint distribution optimal transportation for domain adaptation

Courty, N., Flamary, R., Habrard, A., and Rakotomamonjy, A. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017

2017
[8]

Unbalanced minibatch optimal transport; applications to domain adaptation

Fatras, K., S \'e journ \'e , T., Flamary, R., and Courty, N. Unbalanced minibatch optimal transport; applications to domain adaptation. In International conference on machine learning, pp.\ 3186--3197. PMLR, 2021

2021
[10]

A survey on llm-as-a-judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., et al. A survey on llm-as-a-judge. The Innovation, 2024

2024
[12]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

2025
[13]

Co-teaching: Robust training of deep neural networks with extremely noisy labels

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems, 31, 2018 a

2018
[14]

Co-teaching: Robust training of deep neural networks with extremely noisy labels

Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. Proc. Adv. Neural Inf. Process. Syst., 31, 2018 b

2018
[15]

A survey on the role of crowds in combating online misinformation: Annotators, evaluators, and creators

He, B., Hu, Y., Lee, Y.-C., Oh, S., Verma, G., and Kumar, S. A survey on the role of crowds in combating online misinformation: Annotators, evaluators, and creators. ACM Transactions on Knowledge Discovery from Data, 19 0 (1): 0 1--30, 2025

2025
[16]

A., Zhou, J., Wang, K., Li, B., et al

Ji, J., Hong, D., Zhang, B., Chen, B., Dai, J., Zheng, B., Qiu, T. A., Zhou, J., Wang, K., Li, B., et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 31983--32016, 2025

2025
[17]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Proc. Int. Conf. Learn. Represent., pp.\ 1--9, 2015

2015
[18]

Instance-dependent label distribution estimation for learning with label noise

Liao, Z., Hu, S., Xie, Y., and Xia, Y. Instance-dependent label distribution estimation for learning with label noise. International Journal of Computer Vision, 133 0 (5): 0 2568--2580, 2025

2025
[19]

Learning the latent causal structure for modeling label noise

Lin, Y., Yao, Y., and Liu, T. Learning the latent causal structure for modeling label noise. Advances in Neural Information Processing Systems, 37: 0 120549--120577, 2024

2024
[22]

Curvature-balanced feature manifold learning for long-tailed classification

Ma, Y., Jiao, L., Liu, F., Yang, S., Liu, X., and Li, L. Curvature-balanced feature manifold learning for long-tailed classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 15824--15835, 2023

2023
[24]

Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling

Miao, Y., Zhang, S., Ding, L., Bao, R., Zhang, L., and Tao, D. Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling. Advances in Neural Information Processing Systems, 37: 0 134387--134429, 2024

2024
[25]

Confident learning: Estimating uncertainty in dataset labels

Northcutt, C., Jiang, L., and Chuang, I. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70: 0 1373--1411, 2021

2021
[26]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022
[27]

Making deep neural networks robust to label noise: A loss correction approach

Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1944--1952, 2017

1944
[28]

Qwen2.5 Technical Report

Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

work page internal anchor Pith review arXiv 2025
[30]

Shannon, C. E. A mathematical theory of communication. The Bell system technical journal, 27 0 (3): 0 379--423, 1948

1948
[31]

do anything now

Shen, X., Chen, Z., Backes, M., Shen, Y., and Zhang, Y. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp.\ 1671--1685, 2024

2024
[32]

Defining and characterizing reward gaming

Skalse, J., Howe, N., Krasheninnikov, D., and Krueger, D. Defining and characterizing reward gaming. Advances in Neural Information Processing Systems, 35: 0 9460--9471, 2022

2022
[33]

Learning from noisy labels with deep neural networks: A survey

Song, H., Kim, M., Park, D., Shin, Y., and Lee, J.-G. Learning from noisy labels with deep neural networks: A survey. IEEE Trans. Neural Netw. Learn. Syst. , 34 0 (11): 0 8135--8153, 2022

2022
[35]

Optimal transport for treatment effect estimation

Wang, H., Fan, J., Chen, Z., Li, H., Liu, W., Liu, T., Dai, Q., Wang, Y., Dong, Z., and Tang, R. Optimal transport for treatment effect estimation. Advances in Neural Information Processing Systems, 36: 0 5404--5418, 2023

2023
[36]

Unbiased recommender learning from implicit feedback via weakly supervised learning

Wang, H., Chen, Z., Wang, H., Tan, Y., Pan, L., Liu, T., Chen, X., Li, H., and Lin, Z. Unbiased recommender learning from implicit feedback via weakly supervised learning. In Proc. Int. Conf. Mach. Learn., 2025 a

2025
[37]

Optimal transport for time series imputation

Wang, H., Li, H., Chen, X., Gong, M., Chen, Z., et al. Optimal transport for time series imputation. In Proc. Int. Conf. Learn. Represent., 2025 b

2025
[39]

\ epsilon\ -softmax: Approximating one-hot vectors for mitigating label noise

Wang, J., Zhou, X., Zhai, D., Jiang, J., Ji, X., and Liu, X. \ epsilon\ -softmax: Approximating one-hot vectors for mitigating label noise. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 a . URL https://openreview.net/forum?id=vjsd8Bcipv

2024
[40]

N., Egert, D., Delalleau, O., Scowcroft, J., Kant, N., Swope, A., et al

Wang, Z., Dong, Y., Zeng, J., Adams, V., Sreedhar, M. N., Egert, D., Delalleau, O., Scowcroft, J., Kant, N., Swope, A., et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2024
[41]

To smooth or not? when label smoothing meets noisy labels

Wei, J., Liu, H., Liu, T., Niu, G., Sugiyama, M., and Liu, Y. To smooth or not? when label smoothing meets noisy labels. In Proc. Int. Conf. Mach. Learn., 2022

2022
[42]

Revisiting consistency regularization for deep partial label learning

Wu, D.-D., Wang, D.-B., and Zhang, M.-L. Revisiting consistency regularization for deep partial label learning. In International conference on machine learning, pp.\ 24212--24225. PMLR, 2022

2022
[43]

Robust early-learning: Hindering the memorization of noisy labels

Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. In ICLR, 2021

2021
[44]

Sample selection with uncertainty of losses for learning with noisy labels

Xia, X., Liu, T., Han, B., Gong, M., Yu, J., Niu, G., and Sugiyama, M. Sample selection with uncertainty of losses for learning with noisy labels. In International Conference on Learning Representations, 2022

2022
[45]

A holistic view of label noise transition matrix in deep learning and beyond

Yong, L., Pi, R., Zhang, W., Xia, X., Gao, J., Zhou, X., Liu, T., and Han, B. A holistic view of label noise transition matrix in deep learning and beyond. In The Eleventh International Conference on Learning Representations, 2022

2022
[46]

Early stopping against label noise without validation data

Yuan, S., Feng, L., and Liu, T. Early stopping against label noise without validation data. In Proc. Int. Conf. Learn. Represent., 2025

2025
[47]

Badlabel: A robust perspective on evaluating and enhancing label-noise learning

Zhang, J., Song, B., Wang, H., Han, B., Liu, T., Liu, L., and Sugiyama, M. Badlabel: A robust perspective on evaluating and enhancing label-noise learning. IEEE transactions on pattern analysis and machine intelligence, 46 0 (6): 0 4398--4409, 2024

2024
[48]

Clusterability as an alternative to anchor points when learning with noisy labels

Zhu, Z., Song, Y., and Liu, Y. Clusterability as an alternative to anchor points when learning with noisy labels. In International Conference on Machine Learning, pp.\ 12912--12923. PMLR, 2021

2021
[49]

2025 , volume=

Hao Wang and Zhichao Chen and Zhaoran Liu and Haozhe Li and Degui Yang and Xinggao Liu and Haoxuan Li , title =. 2025 , volume=

2025
[50]

2025 , volume=

Hao Wang and Xinggao Liu and Zhaoran Liu and Haozhe Li and Yilin Liao and Yuxin Huang and Zhichao Chen , journal = IEEE_J_ASE, title=. 2025 , volume=

2025
[51]

An Accurate and Interpretable Framework for Trustworthy Process Monitoring , author =
[52]

Hao Wang and Zhichao Chen and Jiajun Fan and Haoxuan Li and Tianqiao Liu and Weiming Liu and Quanyu Dai and Yichao Wang and Zhenhua Dong and Ruiming Tang , title =
[53]

2024 , volume =

Wang, Hao and Chen, Zhichao and Liu, Zhaoran and Pan, Licheng and Xu, Hu and Liao, Yilin and Li, Haozhe and Liu, Xinggao , journal = IEEE_J_IINF, title =. 2024 , volume =

2024
[54]

Hao Wang and Licheng Pan and Yuan Shen and Zhichao Chen and Degui Yang and Yifei Yang and Sen Zhang and Xinggao Liu and Haoxuan Li and Dacheng Tao , title =
[55]

Time-o1: Time-Series Forecasting Needs Transformed Label Alignment , author=
[56]

Hao Wang and Zhengnan Li and Haoxuan Li and Xu Chen and Mingming Gong and Bin Chen and Zhichao Chen , title =
[57]

Debiased Recommendation via Wasserstein Causal Balancing , author=
[58]

Proximity Matters: Local Proximity Enhanced Balancing for Treatment Effect Estimation , author=
[59]

Unbiased Recommender Learning from Implicit Feedback via Weakly Supervised Learning , author=
[60]

Robust Missing Value Imputation With Proximal Optimal Transport for Low-Quality IIoT Data , author=
[61]

2025 , volume=

Pan, licheng and Wang, Hao and Chen, Zhichao and Huang, Yuxin and Liu, Zhaoran and He, Qunsan and Liu, Xinggao , journal=IEEE_J_ASE, title=. 2025 , volume=

2025
[62]

Controllable Mixture-of-Experts for Multivariate Soft Sensors , author=
[63]

Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection , author =
[64]

A novel locality-sensitive hashing relational graph matching network for semantic textual similarity measurement , author =
[65]

A Novel Pipelined End-to-End Relation Extraction Framework with Entity Mentions and Contextual Semantic Representation , author =
[66]

2024 , volume =

Denoising Diffusion Straightforward Models for Energy Conversion Monitoring Data Imputation , author =. 2024 , volume =

2024
[67]

2024 , volume =

Relaxing the Accurate Imputation Assumption in Doubly Robust Learning for Debiased Collaborative Filtering , author =. 2024 , volume =

2024
[68]

Debiased Recommendation with Noisy Feedback , author =
[69]

Removing hidden confounding in recommendation: a unified multi-task learning approach , author =
[70]

Zhichao Chen and Leilei Ding and Zhixuan Chu and Yucheng Qi and Jianmin Huang and Hao Wang , title =
[71]

Zhichao Chen and Leilei Ding and Jianmin Huang and Zhixuan Chu and Qingyang Dai and Hao Wang , title =
[72]

2024 , volume =

Chen, Zhichao and Wang, Hao and Chen, Guofei and Ma, Yiran and Yao, Le and Ge, Zhiqiang and Song, Zhihuan , journal = IEEE_J_IINF, title =. 2024 , volume =

2024
[73]

Rethinking the diffusion models for missing data imputation: A gradient flow perspective , author =
[74]

2024 , pages =

Chen, Zhichao and Wang, Hao and Song, Zhihuan and Ge, Zhiqiang , journal = IEEE_J_SMCS, title =. 2024 , pages =

2024
[75]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =
[76]

OLinear: A Linear Model for Time Series Forecasting in Orthogonally Transformed Domain , author=
[77]

Towards Accurate Time Series Forecasting via Implicit Decoding , author=
[78]

Nature , volume=

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=
[79]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review arXiv
[80]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review arXiv
[81]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[82]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[83]

the method of paired comparisons , author=

Rank analysis of incomplete block designs: I. the method of paired comparisons , author=. Biometrika , volume=. 1952 , publisher=

1952
[84]

Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

Rewardbench: Evaluating reward models for language modeling , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

2025
[85]

Computational Linguistics , volume=

Bias and fairness in large language models: A survey , author=. Computational Linguistics , volume=. 2024 , publisher=

2024
[86]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[87]

arXiv preprint arXiv:2505.24709 , year=

On Symmetric Losses for Robust Policy Optimization with Noisy Preferences , author=. arXiv preprint arXiv:2505.24709 , year=

work page arXiv
[88]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review arXiv
[89]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review arXiv
[90]

2022 , booktitle = P_SIGKDD, pages =

Ma, Jing and Wan, Mengting and Yang, Longqi and Li, Jundong and Hecht, Brent and Teevan, Jaime , title =. 2022 , booktitle = P_SIGKDD, pages =

2022
[91]

Autodebias: Learning to debias for recommendation , author=
[92]

Dual Unbiased Recommender Learning for Implicit Feedback , booktitle = P_SIGIR, pages =

Jae. Dual Unbiased Recommender Learning for Implicit Feedback , booktitle = P_SIGIR, pages =

Showing first 80 references.