Code Is More Than Text: Uncertainty Estimation for Code Generation

Caiqi Zhang; Haopeng Wang; Nigel Collier; Xiaodong Gu; Yeheng Chen; Yuexian Li; Yuling Shi

arxiv: 2606.09577 · v1 · pith:BWCOFYAFnew · submitted 2026-06-08 · 💻 cs.CL · cs.LG· cs.SE

Code Is More Than Text: Uncertainty Estimation for Code Generation

Yuling Shi , Caiqi Zhang , Yuexian Li , Haopeng Wang , Yeheng Chen , Nigel Collier , Xiaodong Gu This is my paper

Pith reviewed 2026-06-27 16:08 UTC · model grok-4.3

classification 💻 cs.CL cs.LGcs.SE

keywords uncertainty estimationcode generationlarge language modelstoken entropypseudo-code consistencybehavioral consistencyAUROC

0 comments

The pith

Three code-specific uncertainty axes raise average AUROC from 0.696 to 0.776 across five code LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that uncertainty estimation for code generation must account for properties unique to code rather than borrowing directly from natural language methods. Code differs from natural language because a single wrong token can break an entire program, because intent and implementation can diverge, and because programs can be run to check behavior. The authors turn these distinctions into three separate uncertainty measures—lexical via top-k token entropy, algorithmic via pseudo-code consistency, and functional via behavioral consistency—and show that an ensemble of the three improves detection of incorrect generations. A cheap single-pass lexical signal alone matches expensive multi-pass baselines on some models while costing far less. This suggests that treating code as just another form of text misses measurable opportunities for better reliability.

Core claim

Code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points).

What carries the argument

Three orthogonal uncertainty axes (lexical via Top-K token entropy, algorithmic via pseudo-code consistency, functional via behavioral consistency) that capture token fragility, intent-code gap, and executability.

If this is right

The three-axis ensemble improves average AUROC by 8.1 points over the strongest NL-derived baseline across five code LLMs.
On Qwen3-14B the single-pass Top-K token entropy alone matches the strongest multi-pass baseline while costing over 3x less.
Top-K token entropy remains a competitive low-cost signal across multiple models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three axes could be tested on other structured outputs such as formal proofs or API calls.
Functional consistency might be strengthened by running generated code against larger or more diverse test suites.
The cheap single-pass lexical signal could serve as a first filter before invoking more expensive consistency checks.

Load-bearing premise

The three distinctions between code and natural language can be turned into independent measurable axes whose combination produces additive gains without much overlap.

What would settle it

An experiment in which the three axes show high mutual correlation or in which their ensemble produces no AUROC gain over the best single axis or NL baseline would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.09577 by Caiqi Zhang, Haopeng Wang, Nigel Collier, Xiaodong Gu, Yeheng Chen, Yuexian Li, Yuling Shi.

**Figure 3.** Figure 3: Computational efficiency comparison of un [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 2.** Figure 2: Per-sample disagreement analysis on Qwen3- [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 4.** Figure 4: AUROC of Top-K entropy as a function of K, averaged across models and benchmarks. Performance improves from K=1 to K=5, then plateaus. Qwen3-14B block reproduces [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: AUROC and PRAUC of testcases as a function [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper makes a case for code-specific uncertainty axes that deliver an 8-point AUROC lift, but the abstract leaves the measurement details too thin to judge how solid the gains really are.

read the letter

The one or two things your colleague should know: this paper claims that three uncertainty axes built from code properties—token fragility, intent-code gap, and executability—raise average AUROC from 0.696 to 0.776 across five LLMs, and that a cheap single-pass Top-K token entropy can match stronger multi-pass baselines on at least one model.

What is actually new is the explicit mapping of those three code distinctions into lexical, algorithmic, and functional uncertainty measures instead of just reusing NL entropy or temperature. The paper does a reasonable job spelling out why direct ports from text generation miss code-specific failure modes and then shows the ensemble result plus the cost comparison on Qwen3-14B.

The soft spots sit in the missing mechanics. The abstract does not spell out how pseudo-code consistency or behavioral consistency are computed, so it is hard to tell whether the three axes stay orthogonal in practice or whether the reported lift depends on particular baseline definitions or dataset choices. If those measures turn out to overlap more than claimed or introduce their own artifacts, the additive gain could shrink. The weakest assumption is that the three properties are the primary distinctions and that their instantiations will combine cleanly.

This is the sort of paper that matters for people working on selective prediction or human review loops for code LLMs. A reader who needs practical signals for agentic systems would get usable ideas from it. It deserves a serious referee because the central claim is testable and the direction of the results is plausible, even if the full methods and ablations need close checking.

I would recommend sending it to peer review.

Referee Report

0 major / 3 minor

Summary. The paper argues that uncertainty estimation (UE) for code-generating LLMs should exploit three code-specific properties—token fragility, intent-code gap, and executability—rather than porting NL methods. These are instantiated as three axes (lexical: Top-K token entropy; algorithmic: pseudo-code consistency; functional: behavioral consistency) claimed to be orthogonal. Across five code LLMs the three-axis ensemble raises average AUROC from 0.696 (strongest NL baseline) to 0.776; the lexical axis alone matches the best multi-pass baseline on Qwen3-14B while being >3× cheaper.

Significance. If the reported AUROC gains and efficiency results hold under full experimental scrutiny, the work would supply a concrete, code-tailored alternative to NL-derived UE and demonstrate measurable practical benefit for selective prediction in code generation.

minor comments (3)

Abstract and §3: the claim that the three axes are 'orthogonal' and yield 'additive gains' requires an explicit ablation or correlation table showing pairwise overlap; without it the ensemble improvement cannot be attributed to complementarity rather than simple averaging.
§4 (experimental setup): the five models, exact datasets, prompt templates, and definition of 'behavioral consistency' (e.g., test-case generation or execution oracle) are not stated in the provided abstract; these details are load-bearing for reproducing the 0.776 AUROC.
Table 1 or equivalent: report per-model AUROC for each axis individually and for all pairwise combinations so readers can verify that the lexical axis is indeed competitive and that the full ensemble is not dominated by one component.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. The summary accurately captures the core argument that code-specific properties warrant tailored uncertainty axes rather than direct ports from natural language methods, along with the reported AUROC gains and efficiency advantages.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central chain maps three stated code properties (token fragility, intent-code gap, executability) to three named uncertainty axes (lexical Top-K entropy, pseudo-code consistency, behavioral consistency) and reports an empirical AUROC lift from 0.696 to 0.776 on five models. No equations, fitted parameters, or self-citations are visible that reduce any reported prediction or ensemble gain to a quantity defined by the same data or prior author work. The orthogonality claim is framed as an empirical hypothesis whose support is the observed additive improvement rather than a definitional identity or imported uniqueness theorem. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the three code properties can be operationalized into independent axes; no free parameters or invented entities are explicitly named in the abstract.

axioms (1)

domain assumption Code differs from natural language in token fragility, intent-code gap, and executability, which can be measured as orthogonal uncertainty axes.
Stated directly in the abstract as the justification for the three-axis design.

pith-pipeline@v0.9.1-grok · 5765 in / 1378 out tokens · 34249 ms · 2026-06-27T16:08:59.706608+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 9 canonical work pages

[1]

doi: 10.18653/v1/2023.emnlp-main.330

Katherine Tian and Eric Mitchell and Allan Zhou and Archit Sharma and Rafael Rafailov and Huaxiu Yao and Chelsea Finn and Christopher D. Manning , editor =. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.330 , t...

work page doi:10.18653/v1/2023.emnlp-main.330 2023
[2]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021
[3]

LUQ : Long-text Uncertainty Quantification for LLM s

Zhang, Caiqi and Liu, Fangyu and Basaldella, Marco and Collier, Nigel. LUQ : Long-text Uncertainty Quantification for LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.299

work page doi:10.18653/v1/2024.emnlp-main.299 2024
[4]

CodeT: Code Generation with Generated Tests , booktitle =

Bei Chen and Fengji Zhang and Anh Nguyen and Daoguang Zan and Zeqi Lin and Jian. CodeT: Code Generation with Generated Tests , booktitle =. 2023 , url =

2023
[5]

Self-Edit: Fault-Aware Code Editor for Code Generation , booktitle =

Kechi Zhang and Zhuo Li and Jia Li and Ge Li and Zhi Jin , editor =. Self-Edit: Fault-Aware Code Editor for Code Generation , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.45 , timestamp =

work page doi:10.18653/v1/2023.acl-long.45 2023
[6]

Teaching Large Language Models to Self-Debug , booktitle =

Xinyun Chen and Maxwell Lin and Nathanael Sch. Teaching Large Language Models to Self-Debug , booktitle =. 2024 , url =

2024
[7]

Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025

Yuheng Huang and Jiayang Song and Zhijie Wang and Shengming Zhao and Huaming Chen and Felix Juefei. Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models , journal =. 2025 , url =. doi:10.1109/TSE.2024.3519464 , timestamp =

work page doi:10.1109/tse.2024.3519464 2025
[8]

2025 , eprint=

Devstral: Fine-tuning Language Models for Coding Agent Applications , author=. 2025 , eprint=

2025
[9]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[10]

and Li, Yukun and Gao, Huazuo and Ma, Shirong and others , journal =

Zhu, Qihao and Guo, Daya and Shao, Zhihong and Yang, Dejian and Wang, Peiyi and Xu, Runxin and Wu, Y. and Li, Yukun and Gao, Huazuo and Ma, Shirong and others , journal =
[11]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021
[12]

2021 , eprint =

Program Synthesis with Large Language Models , author =. 2021 , eprint =

2021
[13]

2025 , eprint=

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging , author=. 2025 , eprint=

2025
[14]

Yuling Shi and Hongyu Zhang and Chengcheng Wan and Xiaodong Gu , title =. 47th. 2025 , url =. doi:10.1109/ICSE55347.2025.00005 , timestamp =

work page doi:10.1109/icse55347.2025.00005 2025
[15]

Enhancing LLM-Based Code Generation with Complexity Metrics:

Melika Sepidband and Hamed Taherkhani and Song Wang and Hadi Hemmati , editor =. Enhancing LLM-Based Code Generation with Complexity Metrics:. 49th. 2025 , url =. doi:10.1109/COMPSAC65507.2025.00178 , timestamp =

work page doi:10.1109/compsac65507.2025.00178 2025
[16]

smoke test passes

Yewei Song and Tiezhu Sun and Xunzhu Tang and Prateek Rajput and Tegawend. Measuring. 40th. 2025 , url =. doi:10.1109/ASE63991.2025.00343 , timestamp =

work page doi:10.1109/ase63991.2025.00343 2025
[17]

2025 , eprint=

Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation , author=. 2025 , eprint=

2025
[18]

Atomic Calibration of LLMs in Long-Form Generations , booktitle =

Caiqi Zhang and Ruihan Yang and Zhisong Zhang and Xinting Huang and Sen Yang and Dong Yu and Nigel Collier , editor =. Atomic Calibration of LLMs in Long-Form Generations , booktitle =. 2025 , url =

2025
[19]

Weinberger , editor =

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , editor =. On Calibration of Modern Neural Networks , booktitle =. 2017 , url =

2017
[20]

A survey of confidence estimation and calibration in large language models

Jiahui Geng and Fengyu Cai and Yuxia Wang and Heinz Koeppl and Preslav Nakov and Iryna Gurevych , editor =. A Survey of Confidence Estimation and Calibration in Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.366 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.366 2024
[21]

The Twelfth International Conference on Learning Representations,

Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[22]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

2022
[23]

and Elkan, C

Zadrozny, Bianca and Elkan, Charles , title =. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2002 , isbn =. doi:10.1145/775047.775151 , abstract =

work page doi:10.1145/775047.775151 2002
[24]

2024 , eprint=

Perplexed: Understanding When Large Language Models are Confused , author=. 2024 , eprint=

2024
[25]

The Eleventh International Conference on Learning Representations,

Lorenz Kuhn and Yarin Gal and Sebastian Farquhar , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[26]

Andrey Malinin and Mark J. F. Gales , title =. 9th International Conference on Learning Representations,. 2021 , url =

2021
[27]

Zhen Lin and Shubhendu Trivedi and Jimeng Sun , title =. Trans. Mach. Learn. Res. , volume =. 2024 , url =

2024
[28]

2025 , eprint=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

2025
[29]

Entropy-Gated Branching for Efficient Test-Time Reasoning , booktitle =

Xianzhi Li and Ethan Callanan and Abdellah Ghassel and Xiaodan Zhu , editor =. Entropy-Gated Branching for Efficient Test-Time Reasoning , booktitle =. 2026 , url =

2026
[30]

arXiv preprint arXiv:2508.05988 , year=

Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal , author=. arXiv preprint arXiv:2508.05988 , year=

arXiv
[31]

arXiv preprint arXiv:2507.23348 , year=

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution , author=. arXiv preprint arXiv:2507.23348 , year=

arXiv
[32]

arXiv preprint arXiv:2507.23361 , year=

SWE-Exp: Experience-Driven Software Issue Resolution , author=. arXiv preprint arXiv:2507.23361 , year=

arXiv
[33]

arXiv preprint arXiv:2601.16746 , year=

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents , author=. arXiv preprint arXiv:2601.16746 , year=

Pith/arXiv arXiv
[34]

arXiv preprint arXiv:2509.14635 , year=

SWE-QA: Can Language Models Answer Repository-level Code Questions? , author=. arXiv preprint arXiv:2509.14635 , year=

Pith/arXiv arXiv
[35]

arXiv preprint arXiv:2601.00376 , year=

In Line with Context: Repository-Level Code Generation via Context Inlining , author=. arXiv preprint arXiv:2601.00376 , year=

Pith/arXiv arXiv
[36]

arXiv preprint arXiv:2510.00446 , year=

LongCodeZip: Compress Long Context for Code Language Models , author=. arXiv preprint arXiv:2510.00446 , year=

arXiv

[1] [1]

doi: 10.18653/v1/2023.emnlp-main.330

Katherine Tian and Eric Mitchell and Allan Zhou and Archit Sharma and Rafael Rafailov and Huaxiu Yao and Chelsea Finn and Christopher D. Manning , editor =. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.330 , t...

work page doi:10.18653/v1/2023.emnlp-main.330 2023

[2] [2]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

2021

[3] [3]

LUQ : Long-text Uncertainty Quantification for LLM s

Zhang, Caiqi and Liu, Fangyu and Basaldella, Marco and Collier, Nigel. LUQ : Long-text Uncertainty Quantification for LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.299

work page doi:10.18653/v1/2024.emnlp-main.299 2024

[4] [4]

CodeT: Code Generation with Generated Tests , booktitle =

Bei Chen and Fengji Zhang and Anh Nguyen and Daoguang Zan and Zeqi Lin and Jian. CodeT: Code Generation with Generated Tests , booktitle =. 2023 , url =

2023

[5] [5]

Self-Edit: Fault-Aware Code Editor for Code Generation , booktitle =

Kechi Zhang and Zhuo Li and Jia Li and Ge Li and Zhi Jin , editor =. Self-Edit: Fault-Aware Code Editor for Code Generation , booktitle =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.45 , timestamp =

work page doi:10.18653/v1/2023.acl-long.45 2023

[6] [6]

Teaching Large Language Models to Self-Debug , booktitle =

Xinyun Chen and Maxwell Lin and Nathanael Sch. Teaching Large Language Models to Self-Debug , booktitle =. 2024 , url =

2024

[7] [7]

Look before you leap: An exploratory study of uncertainty analysis for large language models.IEEE Transactions on Software Engineering, 51(2):413–429, February 2025

Yuheng Huang and Jiayang Song and Zhijie Wang and Shengming Zhao and Huaming Chen and Felix Juefei. Look Before You Leap: An Exploratory Study of Uncertainty Analysis for Large Language Models , journal =. 2025 , url =. doi:10.1109/TSE.2024.3519464 , timestamp =

work page doi:10.1109/tse.2024.3519464 2025

[8] [8]

2025 , eprint=

Devstral: Fine-tuning Language Models for Coding Agent Applications , author=. 2025 , eprint=

2025

[9] [9]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[10] [10]

and Li, Yukun and Gao, Huazuo and Ma, Shirong and others , journal =

Zhu, Qihao and Guo, Daya and Shao, Zhihong and Yang, Dejian and Wang, Peiyi and Xu, Runxin and Wu, Y. and Li, Yukun and Gao, Huazuo and Ma, Shirong and others , journal =

[11] [11]

Measuring Coding Challenge Competence With

Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Song and Jacob Steinhardt , editor =. Measuring Coding Challenge Competence With. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Ben...

2021

[12] [12]

2021 , eprint =

Program Synthesis with Large Language Models , author =. 2021 , eprint =

2021

[13] [13]

2025 , eprint=

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging , author=. 2025 , eprint=

2025

[14] [14]

Yuling Shi and Hongyu Zhang and Chengcheng Wan and Xiaodong Gu , title =. 47th. 2025 , url =. doi:10.1109/ICSE55347.2025.00005 , timestamp =

work page doi:10.1109/icse55347.2025.00005 2025

[15] [15]

Enhancing LLM-Based Code Generation with Complexity Metrics:

Melika Sepidband and Hamed Taherkhani and Song Wang and Hadi Hemmati , editor =. Enhancing LLM-Based Code Generation with Complexity Metrics:. 49th. 2025 , url =. doi:10.1109/COMPSAC65507.2025.00178 , timestamp =

work page doi:10.1109/compsac65507.2025.00178 2025

[16] [16]

smoke test passes

Yewei Song and Tiezhu Sun and Xunzhu Tang and Prateek Rajput and Tegawend. Measuring. 40th. 2025 , url =. doi:10.1109/ASE63991.2025.00343 , timestamp =

work page doi:10.1109/ase63991.2025.00343 2025

[17] [17]

2025 , eprint=

Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation , author=. 2025 , eprint=

2025

[18] [18]

Atomic Calibration of LLMs in Long-Form Generations , booktitle =

Caiqi Zhang and Ruihan Yang and Zhisong Zhang and Xinting Huang and Sen Yang and Dong Yu and Nigel Collier , editor =. Atomic Calibration of LLMs in Long-Form Generations , booktitle =. 2025 , url =

2025

[19] [19]

Weinberger , editor =

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , editor =. On Calibration of Modern Neural Networks , booktitle =. 2017 , url =

2017

[20] [20]

A survey of confidence estimation and calibration in large language models

Jiahui Geng and Fengyu Cai and Yuxia Wang and Heinz Koeppl and Preslav Nakov and Iryna Gurevych , editor =. A Survey of Confidence Estimation and Calibration in Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.366 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.366 2024

[21] [21]

The Twelfth International Conference on Learning Representations,

Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[22] [22]

2022 , eprint=

Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

2022

[23] [23]

and Elkan, C

Zadrozny, Bianca and Elkan, Charles , title =. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2002 , isbn =. doi:10.1145/775047.775151 , abstract =

work page doi:10.1145/775047.775151 2002

[24] [24]

2024 , eprint=

Perplexed: Understanding When Large Language Models are Confused , author=. 2024 , eprint=

2024

[25] [25]

The Eleventh International Conference on Learning Representations,

Lorenz Kuhn and Yarin Gal and Sebastian Farquhar , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[26] [26]

Andrey Malinin and Mark J. F. Gales , title =. 9th International Conference on Learning Representations,. 2021 , url =

2021

[27] [27]

Zhen Lin and Shubhendu Trivedi and Jimeng Sun , title =. Trans. Mach. Learn. Res. , volume =. 2024 , url =

2024

[28] [28]

2025 , eprint=

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning , author=. 2025 , eprint=

2025

[29] [29]

Entropy-Gated Branching for Efficient Test-Time Reasoning , booktitle =

Xianzhi Li and Ethan Callanan and Abdellah Ghassel and Xiaodan Zhu , editor =. Entropy-Gated Branching for Efficient Test-Time Reasoning , booktitle =. 2026 , url =

2026

[30] [30]

arXiv preprint arXiv:2508.05988 , year=

Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal , author=. arXiv preprint arXiv:2508.05988 , year=

arXiv

[31] [31]

arXiv preprint arXiv:2507.23348 , year=

SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution , author=. arXiv preprint arXiv:2507.23348 , year=

arXiv

[32] [32]

arXiv preprint arXiv:2507.23361 , year=

SWE-Exp: Experience-Driven Software Issue Resolution , author=. arXiv preprint arXiv:2507.23361 , year=

arXiv

[33] [33]

arXiv preprint arXiv:2601.16746 , year=

SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents , author=. arXiv preprint arXiv:2601.16746 , year=

Pith/arXiv arXiv

[34] [34]

arXiv preprint arXiv:2509.14635 , year=

SWE-QA: Can Language Models Answer Repository-level Code Questions? , author=. arXiv preprint arXiv:2509.14635 , year=

Pith/arXiv arXiv

[35] [35]

arXiv preprint arXiv:2601.00376 , year=

In Line with Context: Repository-Level Code Generation via Context Inlining , author=. arXiv preprint arXiv:2601.00376 , year=

Pith/arXiv arXiv

[36] [36]

arXiv preprint arXiv:2510.00446 , year=

LongCodeZip: Compress Long Context for Code Language Models , author=. arXiv preprint arXiv:2510.00446 , year=

arXiv