FASE: Fast Adaptive Semantic Entropy for Code Quality

Ladan Tahvildari; Shizhe Lin

arxiv: 2606.09800 · v1 · pith:XNFOMMQRnew · submitted 2026-06-08 · 💻 cs.SE · cs.AI· cs.MA

FASE: Fast Adaptive Semantic Entropy for Code Quality

Shizhe Lin , Ladan Tahvildari This is my paper

Pith reviewed 2026-06-27 15:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA

keywords semantic entropycode generationfunctional correctnessminimum spanning treeuncertainty quantificationmulti-agent systemsLLM evaluationcode quality

0 comments

The pith

FASE approximates functional correctness of generated code using minimum spanning trees on dissimilarity graphs, delivering 25% better correlation to ground truth at 0.3% of the computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fast Adaptive Semantic Entropy (FASE) as a way to measure uncertainty in multi-agent code generation without relying on expensive LLM-based equivalence checks. It builds structural and semantic dissimilarity graphs from code embeddings and uses the minimum spanning tree to approximate functional equivalence classes. This approach is evaluated on HumanEval and BigCodeBench, showing improved correlation with actual test case results compared to traditional methods. The low cost makes it suitable for real-time use in autonomous software development systems.

Core claim

FASE approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. When using the Qwen3-Embedding-8B model, it achieves a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases, while requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches.

What carries the argument

Minimum spanning tree of structural and semantic dissimilarity graphs built from embeddings, which groups code outputs into approximate functional equivalence classes without LLM entailment checks.

If this is right

FASE can replace costly semantic entropy calculations in multi-agent workflows for better reliability.
It enables faster uncertainty quantification for optimizing code generation agents.
Improved metrics allow better detection of hallucinations in LLM-generated code.
Scales to larger benchmarks like BigCodeBench with minimal overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other domains beyond code, such as natural language generation where equivalence is hard to check.
Combining FASE with other cheap metrics might further improve accuracy without added cost.
Testing on more diverse LLMs beyond Qwen3-Embedding-8B would validate generalizability.

Load-bearing premise

The minimum spanning tree from embedding-based dissimilarity graphs accurately captures the same functional equivalence groups that LLM entailment checks would identify.

What would settle it

Run both FASE and traditional LLM-entailment semantic entropy on the same set of code samples and measure how often their equivalence groupings differ; if they frequently disagree on which outputs are equivalent, the approximation fails.

Figures

Figures reproduced from arXiv: 2606.09800 by Ladan Tahvildari, Shizhe Lin.

**Figure 1.** Figure 1: The risk of hallucination from unreliable agents (red) undermines the credibility of coherent agents (blue) and propagates [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The workflow of computing encoder-based semantic entropy for the samples of code generated for a given task. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pass@1 distribution of code generated by Mistral, CodeLlama, DeepseekCoder, Qwen2.5-Coder on HumanEval and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: The ratio of mean distance in Pairwise Distance Matrix (PDM) and mean edge weight in MST categorized by their connected [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The ratio between mean distance in Pairwise Distance Matrix (PDM) and mean edge weight in MST categorized for tasks with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The Spearman’s 𝜌 correlation measured between the semantic entropy of each method and the pass@1 results when only the coder is involved [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: The Spearman’s 𝜌 correlation measured between the semantic entropy of each method and the pass@1 results when the analyst agent provides additional instruction to the coder The adaptive clustering approaches exhibit stronger correlation when an analyst agent is introduced before the coder agent. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FASE uses MSTs on embedding dissimilarity graphs to approximate semantic entropy for code at low cost, but lacks shown agreement with the LLM entailment clusters it replaces.

read the letter

The main thing to know is that FASE builds graphs from structural and semantic dissimilarities of code embeddings, then uses the minimum spanning tree to approximate semantic entropy without running LLM entailment checks. It reports a 25% Spearman correlation lift and 19% ROCAUC gain versus Pass@1 on HumanEval and BigCodeBench, at roughly 0.3% of the usual runtime.

What is new is the specific construction that combines those two dissimilarity types into one graph for code samples. The paper does well by focusing on the real bottleneck of cost in multi-agent code generation and by giving concrete numbers on both accuracy and speed.

The soft spot is the missing link between FASE clusters and the original entailment-based equivalence classes. The gains are measured only against ground-truth test outcomes, with no reported agreement metric such as adjusted Rand index or pairwise match rate. If the partitions differ substantially, the numbers reflect a different signal rather than a validated cheaper proxy. Graph edge rules, the structural-versus-semantic weight, and threshold choices are also not detailed enough in the abstract to judge sensitivity.

This is for researchers working on uncertainty quantification in LLM-based code tools. A reader who needs fast entropy estimates for agent workflows would find the idea worth testing. It deserves peer review because the cost reduction is a practical target and the central assumption can be checked directly with the right experiments.

Referee Report

3 major / 1 minor

Summary. The paper introduces Fast Adaptive Semantic Entropy (FASE), which approximates functional correctness and semantic entropy for LLM-generated code via minimum spanning trees computed on structural and semantic dissimilarity graphs built from embeddings. On HumanEval and BigCodeBench it reports a 25% average improvement in Spearman correlation and 19% increase in ROCAUC (vs. Pass@1) over LLM-entailment semantic entropy while using only ~0.3% of the runtime, positioning FASE as a low-cost alternative for multi-agent code generation uncertainty quantification.

Significance. If the core approximation is validated, FASE would supply a practical, scalable substitute for expensive LLM-based equivalence checks in uncertainty estimation for code, directly addressing reliability issues in multi-agent workflows. The efficiency claim is notable and the embedding-based proxy idea is a reasonable direction, but the significance hinges on whether the MST clusters faithfully stand in for entailment-derived functional equivalence classes.

major comments (3)

[Evaluation / Results] The central claim that FASE MST clusters reliably approximate the functional equivalence classes obtained by LLM-driven entailment checks is load-bearing, yet the manuscript provides no quantitative agreement metric (adjusted Rand index, pairwise match rate, or similar) between the two partitions. All reported gains are measured only against ground-truth Pass@1, leaving open the possibility that FASE captures a different signal.
[Method / §3] Graph construction details are underspecified: the exact definitions of structural vs. semantic dissimilarity, the dissimilarity threshold for including edges, and the weight used to balance the two dissimilarity types are not stated (these appear among the free parameters). Without these, the method cannot be reproduced and the reported improvements cannot be assessed for robustness.
[Experiments] Statistical significance, confidence intervals, and controls for confounds (model choice, benchmark subset, embedding model) are not reported for the 25% Spearman and 19% ROCAUC figures, making it impossible to judge whether the gains are reliable or driven by particular experimental choices.

minor comments (1)

[Abstract] The abstract states quantitative improvements but omits any mention of the missing validation or parameter details; a short clarification sentence would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Evaluation / Results] The central claim that FASE MST clusters reliably approximate the functional equivalence classes obtained by LLM-driven entailment checks is load-bearing, yet the manuscript provides no quantitative agreement metric (adjusted Rand index, pairwise match rate, or similar) between the two partitions. All reported gains are measured only against ground-truth Pass@1, leaving open the possibility that FASE captures a different signal.

Authors: We agree that reporting a direct agreement metric between FASE-derived clusters and LLM-entailment partitions would strengthen the approximation claim. While the primary validation remains correlation with ground-truth functional correctness via Pass@1 (as this is the target signal for uncertainty quantification), we will add adjusted Rand index and pairwise agreement rates in the revised evaluation section. revision: yes
Referee: [Method / §3] Graph construction details are underspecified: the exact definitions of structural vs. semantic dissimilarity, the dissimilarity threshold for including edges, and the weight used to balance the two dissimilarity types are not stated (these appear among the free parameters). Without these, the method cannot be reproduced and the reported improvements cannot be assessed for robustness.

Authors: We acknowledge the underspecification. The revised manuscript will include precise mathematical definitions of the structural and semantic dissimilarity functions, the edge inclusion threshold, and the balancing weight hyperparameter, along with the specific values employed in all reported experiments. revision: yes
Referee: [Experiments] Statistical significance, confidence intervals, and controls for confounds (model choice, benchmark subset, embedding model) are not reported for the 25% Spearman and 19% ROCAUC figures, making it impossible to judge whether the gains are reliable or driven by particular experimental choices.

Authors: We will augment the experimental results with statistical significance tests (e.g., paired t-tests or Wilcoxon tests), bootstrap confidence intervals, and additional ablation controls varying the LLM, benchmark subsets, and embedding models to demonstrate robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines FASE directly via construction of dissimilarity graphs from embeddings followed by MST computation, then reports empirical Spearman/ROCAUC correlations against independent ground-truth Pass@1 from test cases. No equations, fitted parameters, or self-citations are shown that would make the reported improvements reduce to the inputs by construction. The metric and its evaluation stand as an independent proposal without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about dissimilarity graphs proxying functional equivalence and likely requires choices in graph construction that function as free parameters.

free parameters (2)

weight balancing structural vs semantic dissimilarity
Required to combine the two graphs into a single dissimilarity measure for MST computation.
dissimilarity threshold for graph edges
Likely needed to decide which code pairs are connected in the graphs.

axioms (1)

domain assumption Structural and semantic dissimilarity graphs capture information relevant to functional equivalence of code
Invoked to justify using MST as a proxy for semantic entropy without LLM entailment.

pith-pipeline@v0.9.1-grok · 5724 in / 1350 out tokens · 35572 ms · 2026-06-27T15:16:44.123201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 20 canonical work pages · 12 internal anchors

[1]

Agarwal, V., Pei, Y., Alamir, S., and Liu, X.Codemirage: Hallucinations in code generated by large language models.arXiv preprint arXiv:2408.08333 FASE: Fast Adaptive Semantic Entropy for Code Quality 17 (2024)

work page arXiv 2024
[2]

Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E.Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks, 2025

2025
[3]

J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates

Campello, R. J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates. InPacific-Asia conference on knowledge discovery and data mining(2013), Springer, pp. 160–172

2013
[4]

Cao, D., Hong, Y., Pan, Q., and Wu, J.Program interoperable large language model software testing scheme: a case study on javascript engine fuzzing.IEEE Transactions on Dependable and Secure Computing(2025)

2025
[5]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Chen, X., Liu, J., Zhang, Y., Hu, Q., Han, Y., Zhang, R., Ran, J., Yan, L., Huang, B., and Ma, S.Traceawareness and dual-strategy fuzz testing: Enhancing path coverage and crash localization with stochastic science and large language models.Computers and Electrical Engineering 123(2025), 110266

2025
[7]

Das, S., Deb, N., Chaki, N., and Cortesi, A.A multi-agent rag framework for regulatory compliance checking of software requirements.ACM Transactions on Software Engineering and Methodology(2025)

2025
[8]

Dong, Y., Jiang, X., Jin, Z., and Li, G.Self-collaboration code generation via chatgpt.ACM Transactions on Software Engineering and Methodology 33, 7 (2024), 1–38

2024
[9]

Eghbali, A., and Pradel, M.De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701(2024)

work page arXiv 2024
[10]

In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol. 96, pp. 226–231

1996
[11]

K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., and Lahiri, S. K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

2024
[12]

Farqhar, S., Kossen, J., Kuhn, L., and Gal, Y.Detecting hallucinations in large language models using semantic entropy.Nature 630, 8017 (2024), 625–630

2024
[13]

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Ferrag, M. A., Tihanyi, N., and Debbah, M.From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Foodeei, D., Fan, S., and Jaggi, M.Semantic uncertainty in advanced decoding methods for llm generation.arXiv preprint arXiv:2506.17296(2025)

work page arXiv 2025
[15]

Gagolewski, M., Cena, A., Bartoszuk, M., and Brzozowski, Ł.Clustering with minimum spanning trees: How good can it be?Journal of Classification 42, 1 (2025), 90–112

2025
[16]

Guo, D., Zhu, Q., Y ang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al.Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

He, J., Treude, C., and Lo, D.Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology 34, 5 (2025), 1–30

2025
[18]

InProceedings of the International Conference on Learning Representations(2024), vol

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Yau, S., Lin, Z., Zhou, L., et al.Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations(2024), vol. 2024, pp. 23247–23275

2024
[19]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., W ang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems 43, 2 (2025), 1–55

2025
[20]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Mistral 7B

Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics(2023), pp. 1827–1843. [22]Jiang, A. Q., Sablayrolles, A., Nadeau, N., Usunier, N., and Lample, G.Mistral 7b.arXiv preprint arXiv:2310.06825(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

Khojah, R., de Oliveira Neto, F. G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

2025
[24]

Kuhn, L., Gal, Y., and Farqhar, S.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

[27]Lindley, D

Liao, D., Pan, S., Sun, X., Ren, X., Huang, Q., Xing, Z., Jin, H., and Li, Q.A3-codgen: A repository-level code generation framework for code reuse with local-aware, global-aware, and third-party-library-aware.IEEE Transactions on Software Engineering 50, 12 (2024), 3369–3384. [27]Lindley, D. V.On a measure of the information provided by an experiment.The...

2024
[26]

Liu, F., Liu, Y., Shi, L., Huang, H., W ang, R., Y ang, Z., Zhang, L., Li, Z., and Ma, Y.Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971(2024). [29]Liu, F., Liu, Y., Shi, L., Y ang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y.Beyond functional correctness: Exploring hallucinations in llm-generated code...

work page arXiv 2024
[27]

18 Lin et al

Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., and Lou, Y.Large language model-based agents for software engineering: A survey. 18 Lin et al. ACM Transactions on Software Engineering and Methodology(2024)

2024
[28]

InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp

Liu, J., Zeng, J., W ang, X., and Liang, Z.Learning graph-based code representations for source-level functional similarity detection. InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp. 345–357

2023
[29]

D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

2024
[30]

F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

Liu, Z., Tang, Y., Luo, X., Zhou, Y., and Zhang, L. F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

2024
[31]

InFirst Conference on Language Modeling(2024)

Liu, Z., Zhang, Y., Li, P., Liu, Y., and Y ang, D.A dynamic llm-powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling(2024)

2024
[32]

B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

Maveli, N., Vergari, A., and Cohen, S. B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

work page arXiv 2024
[33]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Mohammadi, M., Li, Y., Lo, J., and Yip, W.Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2(2025), pp. 6129–6139

2025
[34]

Pan, R., Zhang, H., and Liu, C.Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811 (2025)

work page arXiv 2025
[35]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Song, Y., Sun, T., Tang, X., Rajput, P. K., Bissyandé, T. F., and Klein, J.Measuring llm code generation stability via structural entropy. In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering(2025), IEEE, pp. 3922–3926. [40]Talebirad, Y., and Nadiri, A.Multi-agent collaboration: Harnessing the power of intelligent...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Wang, G., Xu, Q., Briand, L., and Liu, K.Mutation-guided unit test generation with a large language model.IEEE Transactions on Software Engineering(2026)

2026
[38]

W ang, W., Wei, F., Dong, L., Bao, H., Y ang, N., and Zhou, M.Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems 33(2020), 5776–5788

2020
[39]

Wei, J., Y ao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y.Measuring and reducing llm hallucination without gold-standard answers.arXiv preprint arXiv:2402.10412(2024)

work page arXiv 2024
[40]

Xu, Q., W ang, G., Briand, L., and Liu, K.Hallucination to consensus: Multi-agent llms for end-to-end junit test generation.ACM Transactions on Software Engineering and Methodology(2026)

2026
[41]

Y ang, B., Dang, J., Liu, H., and Jin, Z.Advancing llm-generated code reliability: A hybrid approach for hallucination detection.IEEE Transactions on Software Engineering(2025)

2025
[42]

Survey on Evaluation of LLM-based Agents

Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., and Shmueli-Scheuer, M.Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp

Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., Lin, H., Yang, B., Xie, P., Huang, F., et al.mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp. 1393–1412

2024
[44]

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Y ang, B., Xie, P., Y ang, A., Liu, D., Lin, J., Huang, F., and Zhou, J.Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.Siren’s song in the ai ocean: A survey on hallucination in large language models.Computational Linguistics(2025), 1–46

2025
[46]

Zhang, Z., Wang, C., Wang, Y., Shi, E., Ma, Y., Zhong, W., Chen, J., Mao, M., and Zheng, Z.Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.ACM on Software Engineering 2, ISSTA022 (2025), 481–503

2025
[47]

InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp

Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., Li, Y., et al.Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp. 5673–5684

2023
[48]

Zhu, K., Du, H., Hong, Z., Y ang, X., Guo, S., W ang, Z., W ang, Z., Qian, C., Tang, X., Ji, H., et al.Multiagentbench: Evaluating the collaboration and competition of llm agents.arXiv preprint arXiv:2503.01935(2025)

work page arXiv 2025
[49]

Zhu, Y., Liu, C., He, X., Ren, X., Liu, Z., Pan, R., and Zhang, H.Adacoder: An adaptive planning and multi-agent framework for function-level code generation.IEEE Transactions on Software Engineering(2025)

2025
[50]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T. Y., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al.Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). Received 5 June 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Agarwal, V., Pei, Y., Alamir, S., and Liu, X.Codemirage: Hallucinations in code generated by large language models.arXiv preprint arXiv:2408.08333 FASE: Fast Adaptive Semantic Entropy for Code Quality 17 (2024)

work page arXiv 2024

[2] [2]

Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E.Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks, 2025

2025

[3] [3]

J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates

Campello, R. J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates. InPacific-Asia conference on knowledge discovery and data mining(2013), Springer, pp. 160–172

2013

[4] [4]

Cao, D., Hong, Y., Pan, Q., and Wu, J.Program interoperable large language model software testing scheme: a case study on javascript engine fuzzing.IEEE Transactions on Dependable and Secure Computing(2025)

2025

[5] [5]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Chen, X., Liu, J., Zhang, Y., Hu, Q., Han, Y., Zhang, R., Ran, J., Yan, L., Huang, B., and Ma, S.Traceawareness and dual-strategy fuzz testing: Enhancing path coverage and crash localization with stochastic science and large language models.Computers and Electrical Engineering 123(2025), 110266

2025

[7] [7]

Das, S., Deb, N., Chaki, N., and Cortesi, A.A multi-agent rag framework for regulatory compliance checking of software requirements.ACM Transactions on Software Engineering and Methodology(2025)

2025

[8] [8]

Dong, Y., Jiang, X., Jin, Z., and Li, G.Self-collaboration code generation via chatgpt.ACM Transactions on Software Engineering and Methodology 33, 7 (2024), 1–38

2024

[9] [9]

Eghbali, A., and Pradel, M.De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701(2024)

work page arXiv 2024

[10] [10]

In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol. 96, pp. 226–231

1996

[11] [11]

K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., and Lahiri, S. K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

2024

[12] [12]

Farqhar, S., Kossen, J., Kuhn, L., and Gal, Y.Detecting hallucinations in large language models using semantic entropy.Nature 630, 8017 (2024), 625–630

2024

[13] [13]

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

Ferrag, M. A., Tihanyi, N., and Debbah, M.From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Foodeei, D., Fan, S., and Jaggi, M.Semantic uncertainty in advanced decoding methods for llm generation.arXiv preprint arXiv:2506.17296(2025)

work page arXiv 2025

[15] [15]

Gagolewski, M., Cena, A., Bartoszuk, M., and Brzozowski, Ł.Clustering with minimum spanning trees: How good can it be?Journal of Classification 42, 1 (2025), 90–112

2025

[16] [16]

Guo, D., Zhu, Q., Y ang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al.Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

He, J., Treude, C., and Lo, D.Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology 34, 5 (2025), 1–30

2025

[18] [18]

InProceedings of the International Conference on Learning Representations(2024), vol

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Yau, S., Lin, Z., Zhou, L., et al.Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations(2024), vol. 2024, pp. 23247–23275

2024

[19] [19]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., W ang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems 43, 2 (2025), 1–55

2025

[20] [20]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Mistral 7B

Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics(2023), pp. 1827–1843. [22]Jiang, A. Q., Sablayrolles, A., Nadeau, N., Usunier, N., and Lample, G.Mistral 7b.arXiv preprint arXiv:2310.06825(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

Khojah, R., de Oliveira Neto, F. G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

2025

[24] [24]

Kuhn, L., Gal, Y., and Farqhar, S.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

[27]Lindley, D

Liao, D., Pan, S., Sun, X., Ren, X., Huang, Q., Xing, Z., Jin, H., and Li, Q.A3-codgen: A repository-level code generation framework for code reuse with local-aware, global-aware, and third-party-library-aware.IEEE Transactions on Software Engineering 50, 12 (2024), 3369–3384. [27]Lindley, D. V.On a measure of the information provided by an experiment.The...

2024

[26] [26]

Liu, F., Liu, Y., Shi, L., Huang, H., W ang, R., Y ang, Z., Zhang, L., Li, Z., and Ma, Y.Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971(2024). [29]Liu, F., Liu, Y., Shi, L., Y ang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y.Beyond functional correctness: Exploring hallucinations in llm-generated code...

work page arXiv 2024

[27] [27]

18 Lin et al

Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., and Lou, Y.Large language model-based agents for software engineering: A survey. 18 Lin et al. ACM Transactions on Software Engineering and Methodology(2024)

2024

[28] [28]

InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp

Liu, J., Zeng, J., W ang, X., and Liang, Z.Learning graph-based code representations for source-level functional similarity detection. InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp. 345–357

2023

[29] [29]

D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

2024

[30] [30]

F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

Liu, Z., Tang, Y., Luo, X., Zhou, Y., and Zhang, L. F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

2024

[31] [31]

InFirst Conference on Language Modeling(2024)

Liu, Z., Zhang, Y., Li, P., Liu, Y., and Y ang, D.A dynamic llm-powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling(2024)

2024

[32] [32]

B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

Maveli, N., Vergari, A., and Cohen, S. B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

work page arXiv 2024

[33] [33]

InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

Mohammadi, M., Li, Y., Lo, J., and Yip, W.Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2(2025), pp. 6129–6139

2025

[34] [34]

Pan, R., Zhang, H., and Liu, C.Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811 (2025)

work page arXiv 2025

[35] [35]

Code Llama: Open Foundation Models for Code

Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Song, Y., Sun, T., Tang, X., Rajput, P. K., Bissyandé, T. F., and Klein, J.Measuring llm code generation stability via structural entropy. In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering(2025), IEEE, pp. 3922–3926. [40]Talebirad, Y., and Nadiri, A.Multi-agent collaboration: Harnessing the power of intelligent...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Wang, G., Xu, Q., Briand, L., and Liu, K.Mutation-guided unit test generation with a large language model.IEEE Transactions on Software Engineering(2026)

2026

[38] [38]

W ang, W., Wei, F., Dong, L., Bao, H., Y ang, N., and Zhou, M.Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems 33(2020), 5776–5788

2020

[39] [39]

Wei, J., Y ao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y.Measuring and reducing llm hallucination without gold-standard answers.arXiv preprint arXiv:2402.10412(2024)

work page arXiv 2024

[40] [40]

Xu, Q., W ang, G., Briand, L., and Liu, K.Hallucination to consensus: Multi-agent llms for end-to-end junit test generation.ACM Transactions on Software Engineering and Methodology(2026)

2026

[41] [41]

Y ang, B., Dang, J., Liu, H., and Jin, Z.Advancing llm-generated code reliability: A hybrid approach for hallucination detection.IEEE Transactions on Software Engineering(2025)

2025

[42] [42]

Survey on Evaluation of LLM-based Agents

Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., and Shmueli-Scheuer, M.Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp

Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., Lin, H., Yang, B., Xie, P., Huang, F., et al.mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp. 1393–1412

2024

[44] [44]

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Y ang, B., Xie, P., Y ang, A., Liu, D., Lin, J., Huang, F., and Zhou, J.Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.Siren’s song in the ai ocean: A survey on hallucination in large language models.Computational Linguistics(2025), 1–46

2025

[46] [46]

Zhang, Z., Wang, C., Wang, Y., Shi, E., Ma, Y., Zhong, W., Chen, J., Mao, M., and Zheng, Z.Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.ACM on Software Engineering 2, ISSTA022 (2025), 481–503

2025

[47] [47]

InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp

Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., Li, Y., et al.Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp. 5673–5684

2023

[48] [48]

Zhu, K., Du, H., Hong, Z., Y ang, X., Guo, S., W ang, Z., W ang, Z., Qian, C., Tang, X., Ji, H., et al.Multiagentbench: Evaluating the collaboration and competition of llm agents.arXiv preprint arXiv:2503.01935(2025)

work page arXiv 2025

[49] [49]

Zhu, Y., Liu, C., He, X., Ren, X., Liu, Z., Pan, R., and Zhang, H.Adacoder: An adaptive planning and multi-agent framework for function-level code generation.IEEE Transactions on Software Engineering(2025)

2025

[50] [50]

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Zhuo, T. Y., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al.Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). Received 5 June 2026

work page internal anchor Pith review Pith/arXiv arXiv 2024