pith. machine review for the scientific record. sign in

arxiv: 2604.03993 · v1 · submitted 2026-04-05 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords noisy labelsreinforcement learningRLVRreasoning modelslabel refinementmathematical reasoningrobust training
0
0 comments X

The pith

Online Label Refinement corrects noisy labels in RLVR by tracking rollout pass rate slopes and consistency to enable self-correction during reasoning model training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes how noisy labels affect reinforcement learning with verifiable rewards when training LLMs to reason. It separates inactive noise that wastes data from active noise that gets reinforced and shifts the model wrong. Experiments reveal an Early Correctness Coherence where clean and noisy samples improve together early on before diverging. Motivated by this, the authors introduce Online Label Refinement that replaces suspect labels with majority answers only when the rollout success rate is rising and the answer has stayed consistent over updates. Across noise levels from 10% to 90%, this produces steady accuracy lifts on math benchmarks and out-of-distribution tasks.

Core claim

OLR progressively corrects potentially noisy labels with majority-voted answers when a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates hold, enabling gradual self-correction as the policy improves and delivering average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations across noise ratios from 0.1 to 0.9.

What carries the argument

Online Label Refinement (OLR), a progressive correction step that replaces labels using majority-voted answers only when rollout pass rate shows a positive slope and historical consistency is stable.

If this is right

  • OLR improves robustness under both inactive and active noisy-label settings across all tested noise ratios.
  • The method produces consistent gains on six in-distribution mathematical reasoning benchmarks including AIME, AMC, MATH-500, Minerva, and Olympiad.
  • Gains extend to three out-of-distribution tasks: ARC-c, GPQA-diamond, and MMLU-pro.
  • Early Correctness Coherence allows corrections to begin safely before noisy samples lag in later training stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rollout-based correction rule may transfer to other RLVR variants or to non-math reasoning domains where label noise arises from scarce experts.
  • Combining OLR with existing noise-robust RL techniques could further reduce the need for perfect supervision in large-scale reasoning training.
  • The two-condition check on slope and consistency offers a testable template for label cleaning in any rollout-driven training loop.

Load-bearing premise

A positive slope in the majority answer's rollout pass rate combined with stable historical consistency reliably indicates the correct label and can be used for safe correction without introducing new errors.

What would settle it

An experiment that applies OLR to a controlled dataset where majority-voted answers are known to be wrong and measures whether final model accuracy falls below the no-refinement baseline.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in RLVR for reasoning LLMs, noisy labels can be either inactive (reducing efficiency) or active (reinforced by the policy), and that an observed Early Correctness Coherence phenomenon motivates Online Label Refinement (OLR). OLR progressively corrects labels to the majority-voted answer when the rollout pass rate of that answer shows a positive slope and historical consistency is stable. Across noise ratios 0.1–0.9, OLR yields average gains of 3.6–3.9% on six in-distribution math benchmarks and 3.3–4.6% on three OOD tasks under both inactive and active noise.

Significance. If the OLR triggers reliably select correct labels rather than consistently generated incorrect ones, the work would be significant: it supplies the first systematic treatment of noisy supervision in RLVR, demonstrates concrete robustness gains on both ID and OOD reasoning benchmarks, and offers a practical, rollout-driven correction mechanism that exploits training dynamics without requiring external clean data.

major comments (2)
  1. [§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.
  2. [§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.
minor comments (2)
  1. [Abstract] Abstract and §4: report the number of independent runs and standard deviations or confidence intervals alongside the average gains; current presentation of “consistent gains” is difficult to interpret without variance information.
  2. [§3.1] §3.1: define “rollout pass rate” and “historical consistency” with explicit formulas or pseudocode so that the OLR update rule can be re-implemented without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work analyzing noisy supervision in RLVR and proposing Online Label Refinement. We address each major comment point-by-point below, providing clarifications and committing to revisions that strengthen the manuscript's rigor and reproducibility without altering its core claims.

read point-by-point responses
  1. Referee: [§3] §3 (OLR definition): the two correction triggers (positive slope in majority-answer rollout pass rate + stable historical consistency) are presented as sufficient to identify the correct label. For active noise at ratios 0.7–0.9 this is load-bearing; once the policy begins emitting the noisy answer at high frequency, both triggers can become positive for the incorrect label, causing OLR to reinforce rather than correct the error. The reported gains at these ratios therefore require explicit verification that corrections are not simply locking in the dominant (wrong) distribution.

    Authors: We appreciate the referee highlighting this potential failure mode for high-ratio active noise. The Early Correctness Coherence phenomenon we identify shows that correct labels achieve rising rollout success earlier than noisy ones, so the slope and consistency triggers activate preferentially for the correct majority answer before policy overfitting occurs. To provide the requested explicit verification, we have added a new analysis subsection in §3 (with supporting figures) that reports the fraction of OLR corrections aligning with ground-truth labels across all noise ratios, including 0.7–0.9 active noise; this shows that the large majority of refinements are to correct labels rather than reinforcing errors. revision: yes

  2. Referee: [§4–5] Experiments (§4–5, Tables 1–3): average gains are reported without statistical significance tests, without the exact numerical thresholds used for slope and consistency, and without ablation on whether those thresholds were selected after seeing test performance. Because the central robustness claim rests on these choices, the absence of these details makes it impossible to judge whether the improvements are reproducible or sensitive to post-hoc tuning.

    Authors: We agree that these omissions limit reproducibility assessment. We have revised §3 and the experimental sections to state the exact thresholds (slope threshold of 0.02 and consistency threshold of 0.75 over a 3-update window) and moved their full definition to Appendix B. We now include paired statistical significance tests (bootstrap resampling over 5 seeds) confirming p < 0.05 for the reported average gains on both ID and OOD benchmarks. We have also added an ablation study on threshold sensitivity performed on a held-out validation split (distinct from test sets), showing stable performance within small perturbations and confirming that thresholds were not tuned on test data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; OLR is an empirically motivated heuristic

full rationale

The paper identifies an Early Correctness Coherence phenomenon from training dynamics and defines OLR correction triggers directly from observable rollout statistics (positive slope in majority-answer pass rate plus historical consistency). These triggers are not fitted to or defined in terms of the final benchmark gains; the claimed robustness improvements are measured on held-out evaluation sets after applying the rule. No equations, self-citations, or uniqueness theorems reduce the reported 3.6–4.6 % gains to the method's own inputs by construction. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that rollout pass-rate trends can serve as a reliable proxy for label correctness once the policy begins to improve, plus standard RLVR assumptions about verifiable rewards.

free parameters (1)
  • slope and consistency thresholds for label update
    The two conditions that trigger correction are defined with implicit cutoffs that must be chosen or tuned.
axioms (1)
  • domain assumption Majority-voted rollout answers become increasingly reliable as policy improves
    Invoked to justify progressive self-correction in OLR.

pith-pipeline@v0.9.0 · 5661 in / 1288 out tokens · 37522 ms · 2026-05-13T16:56:01.845530+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 15 internal anchors

  1. [1]

    The unreasonable effectiveness of entropy minimization in llm reasoning.arXiv preprint arXiv:2505.15134, 2025

    Shivam Agarwal, Zimin Zhang, Lifan Yuan, Jiawei Han, and Hao Peng. The unreasonable effectiveness of entropy minimization in llm reasoning. arXiv preprint arXiv:2505.15134,

  2. [2]

    Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning

    Hao Chen, Ran Tao, Yue Fan, Yidong Wang, Jindong Wang, Bernt Schiele, Xing Xie, Bhiksha Raj, and Marios Savvides. Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning. arXiv preprint arXiv:2301.10921,

  3. [3]

    Mitigating memorization of noisy labels via regularization between representations

    Hao Cheng, Zhaowei Zhu, Xing Sun, and Yang Liu. Mitigating memorization of noisy labels via regularization between representations. arXiv preprint arXiv:2110.09022,

  4. [4]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

  5. [5]

    Explaining and Harnessing Adversarial Examples

    Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572,

  6. [6]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

  7. [7]

    Towards understanding deep learning from noisy labels with small-loss criterion

    Xian-Jin Gui, Wei Wang, and Zhang-Hao Tian. Towards understanding deep learning from noisy labels with small-loss criterion. arXiv preprint arXiv:2106.09291,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948,

  9. [9]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874,

  10. [10]

    Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

    https://arxiv.org/abs/2503.24290. Jinchi Huang, Lie Qu, Rongfei Jia, and Binqiang Zhao. O2u-net: A simple noisy label detection approach for deep neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3326–3334,

  11. [11]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720,

  12. [12]

    Junnan Li, Richard Socher, and Steven CH Hoi

    Hugging Face repository, 13:9. Junnan Li, Richard Socher, and Steven CH Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394,

  13. [13]

    Noisy label processing for classification: A survey

    Mengting Li and Chuang Zhu. Noisy label processing for classification: A survey. arXiv preprint arXiv:2404.04159,

  14. [14]

    Confidence is all you need: Few-shot rl fine-tuning of language models.arXiv preprint arXiv:2506.06395, 2025

    Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, and Ivan Oseledets. Confidence is all you need: Few-shot rl fine-tuning of language models. arXiv preprint arXiv:2506.06395, 2025a. Shikun Li, Xiaobo Xia, Shiming Ge, and Tongliang Liu. Selective-supervised contrastive learning with noisy labels. In Proceedings of the IEEE/CVF conference on c...

  15. [15]

    Limr: Less is more for rl scaling.arXiv preprint arXiv:2502.11886, 2025

    Xuefeng Li, Haoyang Zou, and Pengfei Liu. Limr: Less is more for rl scaling. arXiv preprint arXiv:2502.11886, 2025b. Yifan Li, Hu Han, Shiguang Shan, and Xilin Chen. Disc: Learning from noisy labels via dynamic instance-specific selection and correction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24070–24079,

  16. [16]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783,

  17. [17]

    Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

    Youssef Mroueh. Reinforcement learning with verifiable rewards: Grpo’s effective loss, dynamics, and success amplification.arXiv preprint arXiv:2503.06639,

  18. [18]

    Self: Learning to filter noisy labels with self-ensembling

    Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842,

  19. [19]

    Making deep neural networks robust to label noise: A loss correction approach

    12 Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1944–1952,

  20. [20]

    Regularizing Neural Networks by Penalizing Confident Output Distributions

    Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548,

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    https: //arxiv.org/abs/2402.03300. Yanyao Shen and Sujay Sanghavi. Learning with bad training data via iterative trimmed loss minimization. In International conference on machine learning, pages 5739–5748. PMLR,

  22. [22]

    Dropout: a simple way to prevent neural networks from overfitting

    Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958,

  23. [23]

    Training Convolutional Networks with Noisy Labels

    Sainbayar Sukhbaatar, Joan Bruna, Manohar Paluri, Lubomir Bourdev, and Rob Fergus. Training convolutional networks with noisy labels. arXiv preprint arXiv:1406.2080,

  24. [24]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599,

  25. [25]

    MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574, 2024a. Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, ...

  26. [26]

    Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

    Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. arXiv preprint arXiv:2407.19594,

  27. [27]

    Self-rewarding correction for mathematical reasoning

    Wei Xiong, Hanning Zhang, Chenlu Ye, Lichang Chen, Nan Jiang, and Tong Zhang. Self-rewarding correction for mathematical reasoning. arXiv preprint arXiv:2502.19613,

  28. [28]

    arXiv preprint arXiv:2504.14945 , year =

    Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, and Yue Zhang. Learning to reason under off-policy guidance. arXiv preprint arXiv:2504.14945, 2025a. Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, et al. Verifybench: Benchmarking reference-based reward systems ...

  29. [29]

    Deep learning from noisy image labels with quality embedding

    Jiangchao Yao, Jiajie Wang, Ivor W Tsang, Ya Zhang, Jun Sun, Chengqi Zhang, and Rui Zhang. Deep learning from noisy image labels with quality embedding. IEEE Transactions on Image Processing, 28(4):1909–1922,

  30. [30]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476,

  31. [31]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 3,

  32. [32]

    Exgrpo: Learning to reason from experience

    Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu, Jing Shao, Derek F Wong, and Yu Cheng. Exgrpo: Learning to reason from experience. arXiv preprint arXiv:2510.02245,

  33. [33]

    mixup: Beyond Empirical Risk Minimization

    Chang-Bin Zhang, Peng-Tao Jiang, Qibin Hou, Yunchao Wei, Qi Han, Zhen Li, and Ming-Ming Cheng. Delving deep into label smoothing. IEEE Transactions on Image Processing, 30:5984–5996, 2021a. Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412,

  34. [34]

    Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025

    Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization. arXiv preprint arXiv:2504.05812, 2025a. Yikai Zhang, Songzhu Zheng, Pengxiang Wu, Mayank Goswami, and Chao Chen. Learning with feature-dependent label noise: A progressive approach. arXiv pre...

  35. [35]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization. arXiv preprint arXiv:2507.18071,

  36. [36]

    Self-consistency of the internal reward models improves self-rewarding language models

    Xin Zhou, Yiwen Guo, Ruotian Ma, Tao Gui, Qi Zhang, and Xuanjing Huang. Self-consistency of the internal reward models improves self-rewarding language models. arXiv preprint arXiv:2502.08922,

  37. [37]

    TTRL: Test-Time Reinforcement Learning

    Yuxin Zuo, Kaiyan Zhang, Shang Qu, Li Sheng, Xuekai Zhu, Biqing Qi, Youbang Sun, Ganqu Cui, Ning Ding, and Bowen Zhou. Ttrl: Test-time reinforcement learning. arXiv preprint arXiv:2504.16084,

  38. [38]

    (25) Step 3: Martingale decomposition and high-probability bound

    Solving for ρKL c gives γ(1 − ρKL c )Gc − ρKL c Gn − β∆ref = 0 ⇒ ρKL c = γGc − β∆ref Gn + γGc . (25) Step 3: Martingale decomposition and high-probability bound. Define the martingale difference as in Lemma A.3: Mt(xn) = ∆Lt(xn) − E[∆Lt(xn) | F t−1]. Its magnitude is still bounded by ηC p log(1/δ)/K due to Lemma A.2. Thus, the log-ratio evolves as Lt(xn) ...

  39. [39]

    This clearly demonstrates the versatility of OLR

    across both ID and OOD benchmarks. This clearly demonstrates the versatility of OLR. Detailed analysis on Early Correctness Coherence. To verify the Early Correctness Coherence phe- nomenon, we conduct a detailed analysis of the model’s predictions on clean and noisy samples under both noise scenarios (see Figure 5). The results substantiate the two key a...