Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization
Pith reviewed 2026-05-22 08:50 UTC · model grok-4.3
The pith
Semantic entropy fails to regulate gradient variance in LLM post-training due to anisotropic and calibration gaps that geometry-aware measures and reward calibration can close.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that current entropy-based estimators suffer from an anisotropic gap, which prevents them from capturing directional semantic disagreements in response space, and a calibration gap, which misaligns uncertainty estimates with the quality of the learning signal from rewards. Motivated by this analysis, the authors propose Geometric-aware Calibrated Policy Optimization that integrates geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength, resulting in more faithful tracking of gradient variability and consistent performance gains on multiple benchmarks.
What carries the argument
The Geometric-aware Calibrated Policy Optimization framework, which integrates geometry-aware measures to capture semantic disagreement among responses with reward-based calibration to align uncertainty estimates with learning signal strength.
If this is right
- Uncertainty signals more faithfully characterize and regulate gradient variability during group-based optimization such as GRPO.
- Learning signal quality from rewards becomes better reflected in the uncertainty measures applied to model outputs.
- Post-training performance improves consistently across reasoning and alignment benchmarks by closing the identified gaps.
- Optimization dynamics gain stability when uncertainty is designed to match the needs of the training process rather than relying on entropy alone.
Where Pith is reading between the lines
- Future uncertainty designs for LLM training may need to prioritize geometric structure and reward alignment over information-theoretic entropy measures.
- The same calibration approach could be tested in related settings where distinguishing signal quality from generated data is required, such as iterative self-improvement loops.
- Scaling the geometry-aware component to larger models would test whether the capture of semantic disagreement remains effective as response spaces grow more complex.
Load-bearing premise
Geometry-aware measures capture semantic disagreement in a way that regulates gradient variance, and reward-based calibration reliably aligns uncertainty estimates with learning signal quality.
What would settle it
An experiment showing that GCPO produces no reduction in gradient variance or no performance gains over entropy-based methods on standard post-training benchmarks would disprove the central claim.
Figures
read the original abstract
Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that entropy-based uncertainty estimators used to regulate group-based policy optimization (e.g., GRPO) in critic-free LLM post-training suffer from an anisotropic gap (failure to capture directional semantic disagreement in embedding space) and a calibration gap (misalignment between uncertainty and learning-signal strength). It presents empirical and theoretical analysis identifying these gaps, then introduces Geometric-aware Calibrated Policy Optimization (GCPO) that combines geometry-aware uncertainty measures with reward-based calibration to better track gradient variance and improve optimization dynamics. Experiments on multiple benchmarks reportedly show that GCPO more faithfully tracks gradient variability and yields consistent post-training gains.
Significance. If the gap analysis and the claimed improvements hold, the work supplies a principled lens for designing uncertainty signals that are explicitly aligned with optimization dynamics rather than treated as black-box regularizers. This perspective could inform more stable post-training pipelines for reasoning and alignment tasks. The manuscript does not report machine-checked proofs or fully parameter-free derivations, but the emphasis on linking uncertainty geometry to gradient regulation is a constructive contribution if the supporting evidence is strengthened.
major comments (3)
- [§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.
- [§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.
- [Experimental section] Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.
minor comments (2)
- [§4.1] Notation for the geometry-aware measure is introduced without an explicit equation reference in the main text; a numbered definition would improve readability.
- [Related Work] The abstract states 'to our knowledge, the first principled formulation,' but the related-work section does not explicitly contrast the new formulation against prior uses of embedding geometry in uncertainty estimation for RLHF or preference optimization.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our gap analysis and experimental validation. We respond to each major comment below and commit to revisions that address the identified issues while preserving the core contributions of the work.
read point-by-point responses
-
Referee: [§3] §3 (Gap Analysis): The anisotropic gap is motivated as directional variance in semantic embeddings, yet the manuscript provides no formal definition or bound showing that the proposed geometry-aware measure provably reduces this variance relative to standard entropy; without such a relation the claim that GCPO 'more faithfully tracks gradient variability' remains interpretive rather than derived.
Authors: We agree that the current manuscript motivates the anisotropic gap via directional variance but stops short of a formal definition or bound. In the revision we will add a precise definition of the geometry-aware uncertainty as the trace of the projected covariance matrix onto the leading principal directions of the response embeddings, together with a lemma bounding the reduction in expected gradient variance relative to isotropic entropy under a Lipschitz assumption on the reward function. This will make the tracking claim derivable rather than interpretive. revision: yes
-
Referee: [§4.2] §4.2 (Reward-based Calibration): The calibration step aligns uncertainty estimates to reward signals that are themselves generated inside the same optimization loop used for policy updates. This introduces a circularity risk: the alignment claim may reduce to fitting the uncertainty estimator to the very reward data that drives the gradient, undermining the assertion that the method independently regulates learning-signal quality.
Authors: The concern is valid in principle. However, the calibration procedure uses a lagged, exponentially-smoothed reward buffer computed from a frozen reference policy rather than the live policy gradients; the uncertainty scalar is therefore fitted to historical signal strength and does not directly modulate the current gradient direction. We will revise §4.2 to include an explicit information-flow diagram and a short proof sketch showing that the calibration operator is contractive with respect to the policy-update operator, thereby removing the circularity. revision: yes
-
Referee: Experimental section, gradient-variance plots: The reported improvements in tracking gradient variability are shown only for the full GCPO pipeline. An ablation isolating the geometry-aware component versus the calibration component is missing, making it impossible to determine which element closes which gap or drives the observed performance lift.
Authors: We concur that component-wise ablations are necessary. The revised experimental section will report three additional curves on the gradient-variance plots: geometry-aware uncertainty alone, reward calibration alone, and the combined GCPO. Corresponding tables will quantify the marginal contribution of each module to both variance tracking and downstream benchmark gains, directly addressing the attribution question. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper first performs empirical and theoretical analysis to identify the anisotropic and calibration gaps in existing entropy-based uncertainty estimators. It then motivates GCPO as a framework that integrates geometry-aware measures and reward-based calibration to address those gaps. No load-bearing step reduces a claimed prediction or uniqueness result to a fitted parameter or self-citation by construction; the alignment of uncertainty with learning signal strength is presented as an independent design choice motivated by the prior gap analysis rather than being tautological with the optimization loop itself. The central claims therefore retain independent content from the identified gaps and do not rely on renaming or smuggling prior results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Uncertainty signals can be interpreted as mechanisms for characterizing and regulating gradient variance and learning signal quality.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we introduce geometry-aware measures, including Cosine Dispersion (CD) and Barycentric Transport (BoT), to capture semantic disagreement beyond entropy, and further incorporate a Reward Dispersion (RD) module to align update strength with reward informativeness.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
V(x) = Σ pk Tr(Cov(g|Z=k)) + Tr(Cov(μZ)) (intra- vs inter-cluster gradient variance)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, et al. Llms4all: A review of large language models across academic disciplines.arXiv preprint arXiv:2509.19580, 2025
-
[2]
Yuxuan Tong, Xiwen Zhang, Rui Wang, Ruidong Wu, and Junxian He. Dart-math: Difficulty- aware rejection tuning for mathematical problem-solving.Advances in Neural Information Processing Systems, 37:7821–7846, 2024
work page 2024
-
[3]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[4]
Chaoran Chen, Daodao Zhou, Yanfang Ye, Toby Jia-jun Li, and Yaxing Yao. Clear: Towards contextual llm-empowered privacy policy analysis and risk generation for large language model applications. InProceedings of the 30th International Conference on Intelligent User Interfaces, pages 277–297, 2025
work page 2025
-
[5]
Kimi K2: Open Agentic Intelligence
Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Zehong Wang, Fang Wu, Hongru Wang, Xiangru Tang, Bolian Li, Zhenfei Yin, Yijun Ma, Yiyang Li, Weixiang Sun, Xiusi Chen, et al. Why reasoning fails to plan: A planning-centric analysis of long-horizon decision making in llm agents.arXiv preprint arXiv:2601.22311, 2026
-
[7]
Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026
Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, and Chuxu Zhang. Graph is a substrate across data modalities.arXiv preprint arXiv:2601.22384, 2026
-
[8]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Non-monotonic autoregressive sequence model
Tianyi Ma, Yiyue Qian, Yiyang Li, Zehong Wang, Yifang Ding, Zheyuan Zhang, Yan Liang, Chuxu Zhang, and Yanfang Ye. Non-monotonic autoregressive sequence model. InICML, 2026
work page 2026
-
[11]
Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms
Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce-style optimization for learning from human feedback in llms. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, 2024
work page 2024
-
[12]
Agentic Reinforced Policy Optimization
Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization. arXiv preprint arXiv:2507.19849, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025
Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jing- han Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545, 2025
-
[14]
Lm-polygraph: Uncertainty estimation for language models
Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, et al. Lm-polygraph: Uncertainty estimation for language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages ...
work page 2023
-
[15]
Fact-checking the output of large language models via token-level uncertainty quantification
Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, et al. Fact-checking the output of large language models via token-level uncertainty quantification. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9367–9...
work page 2024
-
[16]
Zehong Wang, Zheyuan Zhang, Nitesh V Chawla, Chuxu Zhang, and Yanfang Ye. Gft: Graph foundation model with transferable tree vocabulary.Advances in neural information processing systems, 37:107403–107443, 2024
work page 2024
-
[17]
arXiv preprint arXiv:2505.12346 , year=
Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization.arXiv preprint arXiv:2505.12346, 2025
-
[18]
Qingyang Zhang, Haitao Wu, Changqing Zhang, Peilin Zhao, and Yatao Bian. Right question is already half the answer: Fully unsupervised llm reasoning incentivization.arXiv preprint arXiv:2504.05812, 2025
-
[19]
Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025
-
[20]
Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for llms from semantic similarities.Advances in Neural Information Processing Systems, 37:8901–8929, 2024
work page 2024
-
[21]
Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, and Xihui Liu. Grpo- care: Consistency-aware reinforcement learning for multimodal reasoning.arXiv preprint arXiv:2506.16141, 2025
-
[22]
Scenario-independent uncertainty estimation for llm-based question answering via factor analysis
Zhihua Wen, Zhizhao Liu, Zhiliang Tian, Shilong Pan, Zhen Huang, Dongsheng Li, and Minlie Huang. Scenario-independent uncertainty estimation for llm-based question answering via factor analysis. InProceedings of the ACM on Web Conference 2025, pages 2378–2390, 2025
work page 2025
-
[23]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. InICLR, 2023
work page 2023
-
[24]
Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, second edition, 2018
work page 2018
-
[25]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Signal-to-noise ratio analysis of policy gradient algorithms
John Roberts and Russ Tedrake. Signal-to-noise ratio analysis of policy gradient algorithms. NeurIPS, 2008
work page 2008
-
[27]
George Casella and Roger Berger.Statistical inference. Chapman and Hall/CRC, 2024
work page 2024
-
[28]
Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge.Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018
work page 2018
-
[29]
A dataset of information-seeking questions and answers anchored in research papers
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021
work page 2021
-
[30]
Cohen, Ruslan Salakhut- dinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhut- dinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InEMNLP, 2018
work page 2018
-
[31]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021. 11
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[32]
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmarking multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37:19209–19253, 2024
work page 2024
-
[33]
Deliberate reasoning in language models as structure-aware planning with an accurate world model
Siheng Xiong, Ali Payani, Yu’an Yang, and Faramarz Fekri. Deliberate reasoning in language models as structure-aware planning with an accurate world model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31900–31931, 2025
work page 2025
-
[34]
The effects of in-domain corpus size on pre-training bert
Chris Sanchez and Zheyuan Zhang. The effects of in-domain corpus size on pre-training bert. arXiv preprint arXiv:2212.07914, 2022
-
[35]
Enhancing language model reasoning with structured multi-level modeling
Siheng Xiong, Ali Payani, and Faramarz Fekri. Enhancing language model reasoning with structured multi-level modeling. InThe Fourteenth International Conference on Learning Representations, 2025
work page 2025
-
[36]
Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C Kerce, and Faramarz Fekri. Scaling search-augmented llm reasoning via adaptive information control.arXiv preprint arXiv:2602.01672, 2026
-
[37]
Cheffusion: Multimodal foundation model integrating recipe and food image generation
Peiyu Li, Xiaobao Huang, Yijun Tian, and Nitesh V Chawla. Cheffusion: Multimodal foundation model integrating recipe and food image generation. InCIKM, 2024
work page 2024
-
[38]
Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025
Peiyu Li, Xiuxiu Tang, Si Chen, Ying Cheng, Ronald Metoyer, Ting Hua, and Nitesh V Chawla. Adaptive testing for llm evaluation: A psychometric alternative to static benchmarks.arXiv, 2025
work page 2025
-
[39]
Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025
Peiyu Li, Xiaobao Huang, Ting Hua, and Nitesh V Chawla. Crochetbench: Can vision-language models move from describing to doing in crochet domain?arXiv, 2025
work page 2025
-
[40]
Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference
Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, and Yanfang Ye. Mapro: Recasting multi-agent prompt optimization as maximum a posteriori inference. InFindings of the Association for Computational Linguistics: EACL 2026, pages 4458–4480, 2026
work page 2026
-
[41]
Zheyuan Zhang, Kaiwen Shi, Zhengqing Yuan, Zehong Wang, Tianyi Ma, Keerthiram Muruge- san, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Agentrouter: A knowledge-graph-guided llm router for collaborative multi-agent question answering.arXiv preprint arXiv:2510.05445, 2025
-
[42]
Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering
Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galassi, Chuxu Zhang, and Yanfang Ye. Ng-router: Graph-supervised multi-agent collaboration for nutrition question answering. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7508–7527, 2026
work page 2026
-
[43]
EvolveRouter: Co-Evolving Routing and Prompt for Multi-Agent Question Answering
Jiatan Huang, Zheyuan Zhang, Kaiwen Shi, Yanfang Ye, and Chuxu Zhang. Evolver- outer: Co-evolving routing and prompt for multi-agent question answering.arXiv preprint arXiv:2604.05149, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Jiatan Huang, Zheyuan Zhang, Tianyi Ma, Mingchen Li, Yaning Zheng, Yanfang Ye, and Chuxu Zhang. Glen-bench: A graph-language based benchmark for nutritional health.arXiv preprint arXiv:2601.18106, 2026
-
[45]
Han Bao, Zheyuan Zhang, Pengcheng Jing, Zhengqing Yuan, Kaiwen Shi, and Yanfang Ye. Drift-bench: Diagnosing cooperative breakdowns in llm agents under input faults via multi-turn interaction.arXiv preprint arXiv:2602.02455, 2026
-
[46]
Yuelin Hu, Zhengxue Cheng, Wei Liu, and Li Song. Entropy-gated selective policy optimization: Token-level gradient allocation for hybrid training of large language models.arXiv preprint arXiv:2602.03309, 2026
-
[47]
Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.arXiv preprint arXiv:2306.13063, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul, Adian Liusie, and Mark John Francis Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.ArXiv, abs/2303.08896,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
URLhttps://api.semanticscholar.org/CorpusID:257557820
-
[50]
Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024
Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier. Luq: Long-text uncertainty quantification for llms.ArXiv, abs/2403.20279, 2024. URL https://api.semanticscholar. org/CorpusID:268793903
- [51]
-
[52]
URLhttps://api.semanticscholar.org/CorpusID:273654396
-
[53]
Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024
work page 2024
-
[54]
Gvpo: Group variance policy optimization for large language model post-training
Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Dingqian Hong, and Hui Xiong. Gvpo: Group variance policy optimization for large language model post-training. arXiv preprint arXiv:2504.19599, 2025
-
[55]
Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, et al. A survey of weight space learning: Understanding, representation, and generation.arXiv preprint arXiv:2603.10090, 2026
-
[56]
Wenwen Qiang, Ziyin Gu, Jiahuan Zhou, Jie Hu, Jingyao Wang, Changwen Zheng, and Hui Xiong. On the plasticity and stability for post-training large language models.arXiv preprint arXiv:2602.06453, 2026
-
[57]
Rence: Learning to reason by noise contrastive estimation
Wenzheng Zhang and Karl Stratos. Rence: Learning to reason by noise contrastive estimation. arXiv preprint arXiv:2601.22432, 2026
-
[58]
Wulin Xie, Rui Dai, Ruidong Ding, Kaikui Liu, Xiangxiang Chu, Xinwen Hou, and Jie Wen. Q-hawkeye: Reliable visual policy optimization for image quality assessment.arXiv preprint arXiv:2601.22920, 2026
-
[59]
Yu Luo, Shuo Han, Yihan Hu, Dong Li, and Jianye Hao. Ratio-variance regularized policy optimization for efficient llm fine-tuning.arXiv preprint arXiv:2601.03320, 2026
-
[60]
Kangda Wei and Ruihong Huang. Mmr-grpo: Accelerating grpo-style training through diversity- aware reward reweighting.arXiv preprint arXiv:2601.09085, 2026
-
[61]
Zhenwen Liang, Sidi Lu, Wenhao Yu, Kishan Panaganti, Yujun Zhou, Haitao Mi, and Dong Yu. Can llms guide their own exploration? gradient-guided reinforcement learning for llm reasoning.arXiv preprint arXiv:2512.15687, 2025
-
[62]
Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025
Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, and Sarath Chandar. Grpo-lambda: Credit assignment improves llm reasoning.arXiv preprint arXiv:2510.00194, 2025. 13 A Related Work A.1 Uncertainty Estimation for Generation and Reasoning. Large language models (LLMs) have advanced rapidly in recent years [1, 33–39]. Building on this progress...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.