pith. sign in

arxiv: 2606.09800 · v1 · pith:XNFOMMQRnew · submitted 2026-06-08 · 💻 cs.SE · cs.AI· cs.MA

FASE: Fast Adaptive Semantic Entropy for Code Quality

Pith reviewed 2026-06-27 15:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.MA
keywords semantic entropycode generationfunctional correctnessminimum spanning treeuncertainty quantificationmulti-agent systemsLLM evaluationcode quality
0
0 comments X

The pith

FASE approximates functional correctness of generated code using minimum spanning trees on dissimilarity graphs, delivering 25% better correlation to ground truth at 0.3% of the computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fast Adaptive Semantic Entropy (FASE) as a way to measure uncertainty in multi-agent code generation without relying on expensive LLM-based equivalence checks. It builds structural and semantic dissimilarity graphs from code embeddings and uses the minimum spanning tree to approximate functional equivalence classes. This approach is evaluated on HumanEval and BigCodeBench, showing improved correlation with actual test case results compared to traditional methods. The low cost makes it suitable for real-time use in autonomous software development systems.

Core claim

FASE approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. When using the Qwen3-Embedding-8B model, it achieves a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases, while requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches.

What carries the argument

Minimum spanning tree of structural and semantic dissimilarity graphs built from embeddings, which groups code outputs into approximate functional equivalence classes without LLM entailment checks.

If this is right

  • FASE can replace costly semantic entropy calculations in multi-agent workflows for better reliability.
  • It enables faster uncertainty quantification for optimizing code generation agents.
  • Improved metrics allow better detection of hallucinations in LLM-generated code.
  • Scales to larger benchmarks like BigCodeBench with minimal overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other domains beyond code, such as natural language generation where equivalence is hard to check.
  • Combining FASE with other cheap metrics might further improve accuracy without added cost.
  • Testing on more diverse LLMs beyond Qwen3-Embedding-8B would validate generalizability.

Load-bearing premise

The minimum spanning tree from embedding-based dissimilarity graphs accurately captures the same functional equivalence groups that LLM entailment checks would identify.

What would settle it

Run both FASE and traditional LLM-entailment semantic entropy on the same set of code samples and measure how often their equivalence groupings differ; if they frequently disagree on which outputs are equivalent, the approximation fails.

Figures

Figures reproduced from arXiv: 2606.09800 by Ladan Tahvildari, Shizhe Lin.

Figure 1
Figure 1. Figure 1: The risk of hallucination from unreliable agents (red) undermines the credibility of coherent agents (blue) and propagates [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of computing encoder-based semantic entropy for the samples of code generated for a given task. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pass@1 distribution of code generated by Mistral, CodeLlama, DeepseekCoder, Qwen2.5-Coder on HumanEval and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The ratio of mean distance in Pairwise Distance Matrix (PDM) and mean edge weight in MST categorized by their connected [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The ratio between mean distance in Pairwise Distance Matrix (PDM) and mean edge weight in MST categorized for tasks with [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The Spearman’s 𝜌 correlation measured between the semantic entropy of each method and the pass@1 results when only the coder is involved [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The Spearman’s 𝜌 correlation measured between the semantic entropy of each method and the pass@1 results when the analyst agent provides additional instruction to the coder The adaptive clustering approaches exhibit stronger correlation when an analyst agent is introduced before the coder agent. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Multi-agent code generation offers a promising paradigm for autonomous software development by simulating the human software engineering lifecycle. However, system reliability remains hindered by LLM hallucinations and error propagation across interacting agents. While semantic entropy provides a principled way to quantify uncertainty without ground-truth answers, current methods often rely on costly LLM-driven equivalence checks. In this work, we introduce Fast Adaptive Semantic Entropy (FASE), a novel metric that approximates functional correctness based on the minimum spanning tree of structural and semantic dissimilarity graphs. Evaluations on HumanEval and BigCodeBench demonstrate that FASE outperforms state-of-the-art semantic entropy by LLM entailment, achieving a 25% average improvement in Spearman correlation and a 19% increase in ROCAUC score against Pass@1 from ground-truth test cases when using the Qwen3-Embedding-8B model. Furthermore, by eliminating costly LLM-driven equivalence evaluation, FASE incurs negligible computational overhead, requiring only approximately 0.3% of the runtime cost of traditional semantic entropy approaches. These results position FASE as a practical, cost-effective solution for optimizing uncertainty quantification in real-world multi-agent workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces Fast Adaptive Semantic Entropy (FASE), which approximates functional correctness and semantic entropy for LLM-generated code via minimum spanning trees computed on structural and semantic dissimilarity graphs built from embeddings. On HumanEval and BigCodeBench it reports a 25% average improvement in Spearman correlation and 19% increase in ROCAUC (vs. Pass@1) over LLM-entailment semantic entropy while using only ~0.3% of the runtime, positioning FASE as a low-cost alternative for multi-agent code generation uncertainty quantification.

Significance. If the core approximation is validated, FASE would supply a practical, scalable substitute for expensive LLM-based equivalence checks in uncertainty estimation for code, directly addressing reliability issues in multi-agent workflows. The efficiency claim is notable and the embedding-based proxy idea is a reasonable direction, but the significance hinges on whether the MST clusters faithfully stand in for entailment-derived functional equivalence classes.

major comments (3)
  1. [Evaluation / Results] The central claim that FASE MST clusters reliably approximate the functional equivalence classes obtained by LLM-driven entailment checks is load-bearing, yet the manuscript provides no quantitative agreement metric (adjusted Rand index, pairwise match rate, or similar) between the two partitions. All reported gains are measured only against ground-truth Pass@1, leaving open the possibility that FASE captures a different signal.
  2. [Method / §3] Graph construction details are underspecified: the exact definitions of structural vs. semantic dissimilarity, the dissimilarity threshold for including edges, and the weight used to balance the two dissimilarity types are not stated (these appear among the free parameters). Without these, the method cannot be reproduced and the reported improvements cannot be assessed for robustness.
  3. [Experiments] Statistical significance, confidence intervals, and controls for confounds (model choice, benchmark subset, embedding model) are not reported for the 25% Spearman and 19% ROCAUC figures, making it impossible to judge whether the gains are reliable or driven by particular experimental choices.
minor comments (1)
  1. [Abstract] The abstract states quantitative improvements but omits any mention of the missing validation or parameter details; a short clarification sentence would help readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Evaluation / Results] The central claim that FASE MST clusters reliably approximate the functional equivalence classes obtained by LLM-driven entailment checks is load-bearing, yet the manuscript provides no quantitative agreement metric (adjusted Rand index, pairwise match rate, or similar) between the two partitions. All reported gains are measured only against ground-truth Pass@1, leaving open the possibility that FASE captures a different signal.

    Authors: We agree that reporting a direct agreement metric between FASE-derived clusters and LLM-entailment partitions would strengthen the approximation claim. While the primary validation remains correlation with ground-truth functional correctness via Pass@1 (as this is the target signal for uncertainty quantification), we will add adjusted Rand index and pairwise agreement rates in the revised evaluation section. revision: yes

  2. Referee: [Method / §3] Graph construction details are underspecified: the exact definitions of structural vs. semantic dissimilarity, the dissimilarity threshold for including edges, and the weight used to balance the two dissimilarity types are not stated (these appear among the free parameters). Without these, the method cannot be reproduced and the reported improvements cannot be assessed for robustness.

    Authors: We acknowledge the underspecification. The revised manuscript will include precise mathematical definitions of the structural and semantic dissimilarity functions, the edge inclusion threshold, and the balancing weight hyperparameter, along with the specific values employed in all reported experiments. revision: yes

  3. Referee: [Experiments] Statistical significance, confidence intervals, and controls for confounds (model choice, benchmark subset, embedding model) are not reported for the 25% Spearman and 19% ROCAUC figures, making it impossible to judge whether the gains are reliable or driven by particular experimental choices.

    Authors: We will augment the experimental results with statistical significance tests (e.g., paired t-tests or Wilcoxon tests), bootstrap confidence intervals, and additional ablation controls varying the LLM, benchmark subsets, and embedding models to demonstrate robustness of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines FASE directly via construction of dissimilarity graphs from embeddings followed by MST computation, then reports empirical Spearman/ROCAUC correlations against independent ground-truth Pass@1 from test cases. No equations, fitted parameters, or self-citations are shown that would make the reported improvements reduce to the inputs by construction. The metric and its evaluation stand as an independent proposal without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about dissimilarity graphs proxying functional equivalence and likely requires choices in graph construction that function as free parameters.

free parameters (2)
  • weight balancing structural vs semantic dissimilarity
    Required to combine the two graphs into a single dissimilarity measure for MST computation.
  • dissimilarity threshold for graph edges
    Likely needed to decide which code pairs are connected in the graphs.
axioms (1)
  • domain assumption Structural and semantic dissimilarity graphs capture information relevant to functional equivalence of code
    Invoked to justify using MST as a proxy for semantic entropy without LLM entailment.

pith-pipeline@v0.9.1-grok · 5724 in / 1350 out tokens · 35572 ms · 2026-06-27T15:16:44.123201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 20 canonical work pages · 12 internal anchors

  1. [1]

    Agarwal, V., Pei, Y., Alamir, S., and Liu, X.Codemirage: Hallucinations in code generated by large language models.arXiv preprint arXiv:2408.08333 FASE: Fast Adaptive Semantic Entropy for Code Quality 17 (2024)

  2. [2]

    Babakhin, Y., Osmulski, R., Ak, R., Moreira, G., Xu, M., Schifferer, B., Liu, B., and Oldridge, E.Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks, 2025

  3. [3]

    J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates

    Campello, R. J., Moulavi, D., and Sander, J.Density-based clustering based on hierarchical density estimates. InPacific-Asia conference on knowledge discovery and data mining(2013), Springer, pp. 160–172

  4. [4]

    Cao, D., Hong, Y., Pan, Q., and Wu, J.Program interoperable large language model software testing scheme: a case study on javascript engine fuzzing.IEEE Transactions on Dependable and Secure Computing(2025)

  5. [5]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al.Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  6. [6]

    Chen, X., Liu, J., Zhang, Y., Hu, Q., Han, Y., Zhang, R., Ran, J., Yan, L., Huang, B., and Ma, S.Traceawareness and dual-strategy fuzz testing: Enhancing path coverage and crash localization with stochastic science and large language models.Computers and Electrical Engineering 123(2025), 110266

  7. [7]

    Das, S., Deb, N., Chaki, N., and Cortesi, A.A multi-agent rag framework for regulatory compliance checking of software requirements.ACM Transactions on Software Engineering and Methodology(2025)

  8. [8]

    Dong, Y., Jiang, X., Jin, Z., and Li, G.Self-collaboration code generation via chatgpt.ACM Transactions on Software Engineering and Methodology 33, 7 (2024), 1–38

  9. [9]

    Eghbali, A., and Pradel, M.De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701(2024)

  10. [10]

    In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol

    Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(1996), vol. 96, pp. 226–231

  11. [11]

    K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

    Fakhoury, S., Naik, A., Sakkas, G., Chakraborty, S., and Lahiri, S. K.Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering 50, 9 (2024), 2254–2268

  12. [12]

    Farqhar, S., Kossen, J., Kuhn, L., and Gal, Y.Detecting hallucinations in large language models using semantic entropy.Nature 630, 8017 (2024), 625–630

  13. [13]

    From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    Ferrag, M. A., Tihanyi, N., and Debbah, M.From llm reasoning to autonomous ai agents: A comprehensive review.arXiv preprint arXiv:2504.19678 (2025)

  14. [14]

    Foodeei, D., Fan, S., and Jaggi, M.Semantic uncertainty in advanced decoding methods for llm generation.arXiv preprint arXiv:2506.17296(2025)

  15. [15]

    Gagolewski, M., Cena, A., Bartoszuk, M., and Brzozowski, Ł.Clustering with minimum spanning trees: How good can it be?Journal of Classification 42, 1 (2025), 90–112

  16. [16]

    Guo, D., Zhu, Q., Y ang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y., et al.Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196(2024)

  17. [17]

    He, J., Treude, C., and Lo, D.Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology 34, 5 (2025), 1–30

  18. [18]

    InProceedings of the International Conference on Learning Representations(2024), vol

    Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Yau, S., Lin, Z., Zhou, L., et al.Metagpt: Meta programming for a multi-agent collaborative framework. InProceedings of the International Conference on Learning Representations(2024), vol. 2024, pp. 23247–23275

  19. [19]

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., W ang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems 43, 2 (2025), 1–55

  20. [20]

    Qwen2.5-Coder Technical Report

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Lu, K., et al.Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

  21. [21]

    Mistral 7B

    Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.Towards mitigating llm hallucination via self reflection. InFindings of the Association for Computational Linguistics(2023), pp. 1827–1843. [22]Jiang, A. Q., Sablayrolles, A., Nadeau, N., Usunier, N., and Lample, G.Mistral 7b.arXiv preprint arXiv:2310.06825(2023)

  22. [22]

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al.Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221(2022)

  23. [23]

    G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

    Khojah, R., de Oliveira Neto, F. G., Mohamad, M., and Leitner, P.The impact of prompt programming on function-level code generation.IEEE Transactions on Software Engineering(2025)

  24. [24]

    Kuhn, L., Gal, Y., and Farqhar, S.Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

  25. [25]

    [27]Lindley, D

    Liao, D., Pan, S., Sun, X., Ren, X., Huang, Q., Xing, Z., Jin, H., and Li, Q.A3-codgen: A repository-level code generation framework for code reuse with local-aware, global-aware, and third-party-library-aware.IEEE Transactions on Software Engineering 50, 12 (2024), 3369–3384. [27]Lindley, D. V.On a measure of the information provided by an experiment.The...

  26. [26]

    Liu, F., Liu, Y., Shi, L., Huang, H., W ang, R., Y ang, Z., Zhang, L., Li, Z., and Ma, Y.Exploring and evaluating hallucinations in llm-powered code generation.arXiv preprint arXiv:2404.00971(2024). [29]Liu, F., Liu, Y., Shi, L., Y ang, Z., Zhang, L., Lian, X., Li, Z., and Ma, Y.Beyond functional correctness: Exploring hallucinations in llm-generated code...

  27. [27]

    18 Lin et al

    Liu, J., Wang, K., Chen, Y., Peng, X., Chen, Z., Zhang, L., and Lou, Y.Large language model-based agents for software engineering: A survey. 18 Lin et al. ACM Transactions on Software Engineering and Methodology(2024)

  28. [28]

    InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp

    Liu, J., Zeng, J., W ang, X., and Liang, Z.Learning graph-based code representations for source-level functional similarity detection. InProceedings of the 45th International Conference on Software Engineering(2023), IEEE, pp. 345–357

  29. [29]

    D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

    Liu, Y., Le-Cong, T., Widyasari, R., Tantithamthavorn, C., Li, L., Le, X.-B. D., and Lo, D.Refining chatgpt-generated code: Characterizing and mitigating code quality issues.ACM Transactions on Software Engineering and Methodology 33, 5 (2024), 1–26

  30. [30]

    F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

    Liu, Z., Tang, Y., Luo, X., Zhou, Y., and Zhang, L. F.No need to lift a finger anymore? assessing the quality of code generation by chatgpt.IEEE Transactions on Software Engineering 50, 6 (2024), 1548–1584

  31. [31]

    InFirst Conference on Language Modeling(2024)

    Liu, Z., Zhang, Y., Li, P., Liu, Y., and Y ang, D.A dynamic llm-powered agent network for task-oriented agent collaboration. InFirst Conference on Language Modeling(2024)

  32. [32]

    B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

    Maveli, N., Vergari, A., and Cohen, S. B.What can large language models capture about code functional equivalence?arXiv preprint arXiv:2408.11081(2024)

  33. [33]

    InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

    Mohammadi, M., Li, Y., Lo, J., and Yip, W.Evaluation and benchmarking of llm agents: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2(2025), pp. 6129–6139

  34. [34]

    Pan, R., Zhang, H., and Liu, C.Codecor: An llm-based self-reflective multi-agent framework for code generation.arXiv preprint arXiv:2501.07811 (2025)

  35. [35]

    Code Llama: Open Foundation Models for Code

    Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., et al.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950(2023)

  36. [36]

    Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

    Song, Y., Sun, T., Tang, X., Rajput, P. K., Bissyandé, T. F., and Klein, J.Measuring llm code generation stability via structural entropy. In Proceedings of the 40th IEEE/ACM International Conference on Automated Software Engineering(2025), IEEE, pp. 3922–3926. [40]Talebirad, Y., and Nadiri, A.Multi-agent collaboration: Harnessing the power of intelligent...

  37. [37]

    Wang, G., Xu, Q., Briand, L., and Liu, K.Mutation-guided unit test generation with a large language model.IEEE Transactions on Software Engineering(2026)

  38. [38]

    W ang, W., Wei, F., Dong, L., Bao, H., Y ang, N., and Zhou, M.Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.Advances in neural information processing systems 33(2020), 5776–5788

  39. [39]

    Wei, J., Y ao, Y., Ton, J.-F., Guo, H., Estornell, A., and Liu, Y.Measuring and reducing llm hallucination without gold-standard answers.arXiv preprint arXiv:2402.10412(2024)

  40. [40]

    Xu, Q., W ang, G., Briand, L., and Liu, K.Hallucination to consensus: Multi-agent llms for end-to-end junit test generation.ACM Transactions on Software Engineering and Methodology(2026)

  41. [41]

    Y ang, B., Dang, J., Liu, H., and Jin, Z.Advancing llm-generated code reliability: A hybrid approach for hallucination detection.IEEE Transactions on Software Engineering(2025)

  42. [42]

    Survey on Evaluation of LLM-based Agents

    Yehudai, A., Eden, L., Li, A., Uziel, G., Zhao, Y., Bar-Haim, R., Cohan, A., and Shmueli-Scheuer, M.Survey on evaluation of llm-based agents. arXiv preprint arXiv:2503.16416(2025)

  43. [43]

    InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp

    Zhang, X., Zhang, Y., Long, D., Xie, W., Dai, Z., Tang, J., Lin, H., Yang, B., Xie, P., Huang, F., et al.mgte: Generalized long-context text representation and reranking models for multilingual text retrieval. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track(2024), pp. 1393–1412

  44. [44]

    Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Y ang, B., Xie, P., Y ang, A., Liu, D., Lin, J., Huang, F., and Zhou, J.Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176(2025)

  45. [45]

    Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., Huang, X., Zhao, E., Zhang, Y., Chen, Y., et al.Siren’s song in the ai ocean: A survey on hallucination in large language models.Computational Linguistics(2025), 1–46

  46. [46]

    Zhang, Z., Wang, C., Wang, Y., Shi, E., Ma, Y., Zhong, W., Chen, J., Mao, M., and Zheng, Z.Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation.ACM on Software Engineering 2, ISSTA022 (2025), 481–503

  47. [47]

    InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp

    Zheng, Q., Xia, X., Zou, X., Dong, Y., Wang, S., Xue, Y., Shen, L., Wang, Z., Wang, A., Li, Y., et al.Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining(2023), pp. 5673–5684

  48. [48]

    Zhu, K., Du, H., Hong, Z., Y ang, X., Guo, S., W ang, Z., W ang, Z., Qian, C., Tang, X., Ji, H., et al.Multiagentbench: Evaluating the collaboration and competition of llm agents.arXiv preprint arXiv:2503.01935(2025)

  49. [49]

    Zhu, Y., Liu, C., He, X., Ren, X., Liu, Z., Pan, R., and Zhang, H.Adacoder: An adaptive planning and multi-agent framework for function-level code generation.IEEE Transactions on Software Engineering(2025)

  50. [50]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    Zhuo, T. Y., Vu, M. C., Chim, J., Hu, H., Yu, W., Widyasari, R., Yusuf, I. N. B., Zhan, H., He, J., Paul, I., et al.Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). Received 5 June 2026