arxiv: 2604.21505 · v1 · submitted 2026-04-23 · 💻 cs.SE

Recognition: unknown

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation

Di Yang , Xinou Xie , Xiuwen Yang , Ming Hu , Yihao Huang , Yueling Zhang , Weikai Miao , Ting Su

show 2 more authors

Chengcheng Wan Geguang Pu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:11 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationrequirement ambiguitysoftware requirementscode generation benchmarkfunctional correctnessnatural language ambiguityautomated software engineering

0 comments

The pith

Ambiguous requirements degrade the correctness of code generated by all tested LLMs and cause them to produce inconsistent implementations for the same task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark called Orchid containing 1,304 function-level coding tasks that deliberately incorporate four kinds of natural-language ambiguity. It then measures how well current LLMs perform when the input specification is unclear rather than precise. Results show consistent drops in functional correctness for every model, with larger drops in the strongest models, plus frequent divergence among the solutions different models or runs produce for one ambiguous prompt. The work also finds that the models do not notice the ambiguity themselves or ask for clarification. These outcomes matter because real software requirements are rarely perfectly precise, so automated code generation tools will encounter this situation often.

Core claim

We introduce Orchid, the first benchmark of function-level code-generation tasks built around ambiguous requirements. It covers 1,304 tasks across four ambiguity categories: lexical, syntactic, semantic, and vagueness. Systematic evaluation of multiple LLMs on Orchid shows that ambiguity lowers pass rates for every model, with the largest absolute declines appearing in the most capable models. For any given ambiguous requirement the models frequently emit functionally different implementations, and they do not detect or resolve the ambiguity without external help.

What carries the argument

The Orchid benchmark of 1,304 function-level tasks, each constructed to contain one of four explicit ambiguity types, used to compare LLM outputs under clear versus uncertain natural-language specifications.

If this is right

Every LLM tested loses accuracy when requirements contain ambiguity, so prompt-engineering or fine-tuning that assumes perfect clarity will overestimate real-world performance.
Stronger models exhibit larger absolute losses, indicating that scaling alone does not confer robustness to unclear specifications.
LLMs produce multiple functionally distinct implementations for the same ambiguous prompt, which raises the risk of silent failures in automated pipelines.
Current models cannot autonomously flag or resolve ambiguity, so any reliable system must add an external clarification step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Development teams that rely on LLMs for code generation will need explicit requirement-review stages before generation to avoid the observed performance penalty.
Future benchmarks and evaluation suites for code LLMs should include ambiguous inputs as a standard stress test rather than treating clarity as the default.
The observed divergence among outputs suggests that ensemble or voting methods may be less effective when the input itself is underspecified.
Techniques that let an LLM request clarification from a human or from additional context sources could close part of the gap shown by Orchid.

Load-bearing premise

The four types of ambiguity built into Orchid reflect the ambiguities that actually appear in real software projects, and functional correctness can still be judged reliably even when the requirement statement is ambiguous.

What would settle it

Re-running the same models on a fresh collection of requirements taken directly from open-source issue trackers or industrial specifications, without any artificial ambiguity injection, and finding no drop in pass rates or no increase in output divergence.

Figures

Figures reproduced from arXiv: 2604.21505 by Chengcheng Wan, Di Yang, Geguang Pu, Ming Hu, Ting Su, Weikai Miao, Xinou Xie, Xiuwen Yang, Yihao Huang, Yueling Zhang.

**Figure 2.** Figure 2: A lexical ambiguity example from Orchid-HEval [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A syntactic ambiguity example from Orchid-HEval [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: A vagueness ambiguity example from Orchid-HEval [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of Orchid Construction Process. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Orchid Benchmark Statistics. The chart displays the distribution of data sources and lists the covered ambiguity types, [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Functional diversity of LLMs on Orchid. As summarized in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Example from Orchid-HEval #65 where GPT-4 recognizes and localizes ambiguity. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Software requirement ambiguity is ubiquitous in real-world development, stemming from the inherent imprecision of natural language and the varying interpretations of stakeholders. While Large Language Models (LLMs) have demonstrated impressive capabilities in generating code from precise specifications, such ambiguity poses a significant obstacle to reliable automated code generation. Existing benchmarks typically assume clear and unambiguous requirements, leaving an empirical gap in understanding how LLMs behave when faced with the inherent uncertainty of real-world software requirements. In this paper, we introduce Orchid, the first code generation benchmark specifically designed with ambiguous requirements. It comprises 1,304 function-level tasks covering four distinct types of ambiguity: lexical, syntactic, semantic, and vagueness. Leveraging this dataset, we conduct the first systematic empirical study to evaluate the impact of requirement ambiguity on LLM-based code generation. Our results demonstrate that ambiguity consistently degrades the performance of all evaluated LLMs, with the most pronounced negative effects observed in highly advanced models. Furthermore, we observe that LLMs frequently produce functionally divergent implementations for the same ambiguous requirement and lack the capability to identify or resolve such ambiguity autonomously. These findings reveal a significant performance gap between clear and ambiguous requirements, underscoring the urgent need for ambiguity-aware techniques in the next generation of automated software engineering tools. The Orchid benchmark is publicly available at https://huggingface.co/datasets/SII-YDD/Orchid.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Orchid is a useful new benchmark for ambiguous requirements in LLM code generation, but the reported performance drops may be inflated by how correctness is scored when multiple interpretations are possible.

read the letter

The main takeaway is that this paper creates Orchid, a public benchmark of 1,304 function-level tasks built with four kinds of ambiguity (lexical, syntactic, semantic, vagueness), and shows that LLMs generate worse code on these than on clear versions, with bigger drops for stronger models. Models also tend to produce functionally different outputs for the same ambiguous requirement and do not flag the ambiguity on their own. The dataset release is the concrete step forward here. Most existing code generation benchmarks assume precise specifications, so this is the first systematic collection that injects real ambiguity types and runs the comparison across models. That gives practitioners a starting point for testing tools on messier inputs that match actual development work. The experiments are straightforward and cover enough models to show the pattern holds. The soft spot is in the correctness measurement. When requirements are ambiguous, several implementations can be valid depending on which reading a stakeholder takes. If the test suites or reference implementations are fixed to one disambiguation, then an LLM that lands on a different but still reasonable version gets marked as a failure. This setup can widen the clear-versus-ambiguous gap and make advanced models look especially weak if they explore more alternatives. The paper would be tighter if it showed that pass rates stay stable when plausible alternative interpretations are tested. This is aimed at researchers and tool builders working on LLM-based code generation for practical use. Readers who care about the gap between lab benchmarks and real requirements will get direct value from the data. It deserves a serious referee because the benchmark is new and reusable, even though the evaluation details need checking. I would send it for review and ask the authors to clarify how they establish functional equivalence under ambiguity.

Referee Report

1 major / 2 minor

Summary. The paper introduces Orchid, the first benchmark for LLM-based function-level code generation under ambiguous requirements, comprising 1,304 tasks across lexical, syntactic, semantic, and vagueness ambiguity types. Through systematic evaluation, it claims that ambiguity consistently degrades LLM performance (most pronounced in advanced models), that LLMs frequently generate functionally divergent implementations for the same ambiguous requirement, and that they lack the ability to autonomously identify or resolve such ambiguities. The benchmark is publicly released.

Significance. If the empirical results hold under rigorous validation of correctness measurement, this work fills a notable gap in existing code-generation benchmarks that assume precise specifications. The public release of Orchid enables reproducibility and follow-on research, while the finding of greater degradation in stronger models could usefully inform priorities for ambiguity-aware tooling in automated software engineering.

major comments (1)

[Experimental Evaluation] The headline result of consistent performance degradation (and the claim that advanced models suffer most) depends on how functional correctness is scored for ambiguous requirements. The evaluation must demonstrate that pass/fail labels remain stable across plausible stakeholder interpretations rather than relying on a single reference implementation or narrow test suite derived from one disambiguation; otherwise the measured gap may be artifactual. This needs explicit treatment in the experimental setup and results sections, including any multi-interpretation validation performed.

minor comments (2)

[Dataset Construction] Clarify the exact procedure used to construct the four ambiguity types in Orchid while ensuring the tasks remain realistic and the original functional intent is preserved where possible.
[Results] Add statistical significance testing (e.g., paired tests or confidence intervals) for the reported performance differences between clear and ambiguous conditions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to clarify our experimental design. We address the major comment below.

read point-by-point responses

Referee: [Experimental Evaluation] The headline result of consistent performance degradation (and the claim that advanced models suffer most) depends on how functional correctness is scored for ambiguous requirements. The evaluation must demonstrate that pass/fail labels remain stable across plausible stakeholder interpretations rather than relying on a single reference implementation or narrow test suite derived from one disambiguation; otherwise the measured gap may be artifactual. This needs explicit treatment in the experimental setup and results sections, including any multi-interpretation validation performed.

Authors: We agree that demonstrating the stability of pass/fail labels across plausible interpretations is essential to substantiate the headline claims. In the Orchid benchmark, test cases for each ambiguous requirement were constructed to target the intersection of core behaviors that should hold under multiple reasonable stakeholder interpretations for the given ambiguity type, rather than a single reference implementation. However, we acknowledge that the original manuscript did not provide an explicit sensitivity analysis or multi-interpretation validation to quantify label stability. We will revise the Experimental Setup section to detail the test-case design process and add a new analysis in the Results section reporting pass-rate variance across alternative interpretation-derived test suites. This revision will directly address the concern and strengthen the empirical foundation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on released benchmark

full rationale

The paper introduces Orchid as a new benchmark with 1304 tasks across four ambiguity types and reports direct empirical results from evaluating multiple LLMs on clear vs. ambiguous versions. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Claims rest on observable pass/fail rates and output divergence measured against the released dataset, with no reduction of results to self-definitional inputs or prior author work invoked as uniqueness theorems. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new benchmark but relies on standard domain assumptions about ambiguity classification and code evaluation rather than new free parameters or invented entities.

axioms (1)

domain assumption Functional correctness of generated code can be assessed reliably even when the original requirement is ambiguous
Invoked implicitly when reporting performance degradation; standard in code generation evaluation but load-bearing here.

pith-pipeline@v0.9.0 · 5566 in / 1313 out tokens · 66062 ms · 2026-05-09T21:11:07.608408+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 13 internal anchors

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Anthropic. 2024. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/ news/claude-3-5-sonnet

2024
[3]

Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, et al. 2022. Multi-lingual evaluation of code generation models.arXiv preprint arXiv:2210.14868(2022)

work page arXiv 2022
[4]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Muneera Bano. 2015. Addressing the challenges of requirements ambiguity: A review of empirical literature. In2015 IEEE fifth international workshop on empirical requirements engineering (EmpiRE). IEEE, 21–24

2015
[6]

Daniel M Berry and Erik Kamsties. 2004. Ambiguity in requirements specification. InPerspectives on software requirements. Springer, 7–44

2004
[7]

Adithya Bhaskar, Tushar Tomar, Ashutosh Sathe, and Sunita Sarawagi. 2023. Benchmarking and improving text-to-sql generation under ambiguity.arXiv preprint arXiv:2310.13659(2023)

work page arXiv 2023
[8]

Frederik P Brooks. 1987. Essence and accidents of software engineering.IEEE computer20, 4 (1987), 10–19

1987
[9]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

2020
[10]

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2023. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering 49, 7 (2023), 3675–3691

2023
[11]

Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, et al. 2023. Can it edit? evaluating the ability of large language models to follow code editing instructions.arXiv preprint arXiv:2312.12450(2023)

work page arXiv 2023
[12]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

2024
[14]

Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh. 2022. TAPHSIR: towards AnaPHoric ambiguity detection and ReSolution in require- ments. InProceedings of the 30th ACM joint european software engineering confer- ence and symposium on the foundations of software engineering. 1677–1681

2022
[15]

Saad Ezzini, Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, and Lionel C Briand. 2021. Using domain-specific corpora for improved handling of ambiguity in requirements. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1485–1497

2021
[16]

Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering(2024)

2024
[17]

Alessio Ferrari and Andrea Esuli. 2019. An NLP approach for cross-domain am- biguity detection in requirements engineering.Automated Software Engineering 26, 3 (2019), 559–598

2019
[18]

Jannik Fischbach, Julian Frattini, Daniel Mendez, Michael Unterkalmsteiner, Henning Femmer, and Andreas Vogelsang. 2021. How do practitioners interpret conditionals in requirements?. InProduct-Focused Software Process Improvement: 22nd International Conference, PROFES 2021, Turin, Italy, November 26, 2021, Proceedings 22. Springer, 85–102

2021
[19]

Emanuele Gentili and Davide Falessi. 2023. Characterizing Requirements Smells. InInternational Conference on Product-Focused Software Process Improvement. Springer, 387–398

2023
[20]

Vincenzo Gervasi and Didar Zowghi. 2005. Reasoning about inconsistencies in natural language requirements.ACM Transactions on Software Engineering and Methodology (TOSEM)14, 3 (2005), 277–330

2005
[21]

Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Eval- uation of llms on syntax-aware code fill-in-the-middle tasks.arXiv preprint arXiv:2403.04814(2024)

work page arXiv 2024
[22]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn- naeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065(2024)

work page internal anchor Pith review arXiv 2024
[23]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021)

work page internal anchor Pith review arXiv 2021
[25]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)

work page internal anchor Pith review arXiv 2024
[26]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974(2024)

work page internal anchor Pith review arXiv 2024
[27]

Haoxiang Jia, Robbie Morris, He Ye, Federica Sarro, and Sergey Mechtaev. 2025. Automated Repair of Ambiguous Natural Language Requirements.arXiv preprint arXiv:2505.07270(2025)

work page arXiv 2025
[28]

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

work page internal anchor Pith review arXiv 2023
[29]

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning. PMLR, 18319–18345

2023
[30]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572

2023
[32]

Khalid Abdikarim Mohamed, Jamilah Din, and Salmi Baharom. 2022. A tool to detect pragmatic ambiguity with possible interpretations suggestion in software requirement specifications.International Journal of Synergy in Engineering and Technology3, 2 (2022), 52–60

2022
[33]

Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM 11 Di Yang, Xinou Xie, Xiuwen Yang, Ming Hu, Yihao Huang, Yueling Zhang, Weikai Miao, Ting Su, Chengcheng Wan, and Geguang Pu on Sof...

2024
[34]

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Sil- vio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)

work page internal anchor Pith review arXiv 2022
[35]

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)

work page internal anchor Pith review arXiv 2023
[36]

Unnati S Shah and Devesh C Jinwala. 2015. Resolving ambiguities in natural lan- guage software requirements: a comprehensive survey.ACM SIGSOFT Software Engineering Notes40, 5 (2015), 1–7

2015
[37]

Jacline Sudah Sinpang, Shahida Sulaiman, and Norsham Idris. 2017. Detecting ambiguity in requirements analysis using mamdani fuzzy inference.Journal of Telecommunication, Electronic and Computer Engineering (JTEC)9, 3-4 (2017), 157–162

2017
[38]

KATO Toshiharu and Kazuhiko Tsuda. 2022. A method of ambiguity detection in requirement specifications by using a knowledge dictionary.Procedia Computer Science207 (2022), 1482–1489

2022
[39]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier- aware unified pre-trained encoder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859(2021)

work page arXiv 2021
[40]

Jie JW Wu and Fatemeh H Fard. 2024. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agent.arXiv preprint arXiv:2406.00215(2024)

work page arXiv 2024
[41]

Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig
[42]

InProceedings of the 15th international conference on mining software repositories

Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th international conference on mining software repositories. 476–486
[43]

Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
[44]

Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). 12

work page internal anchor Pith review arXiv 2024