Recognition: unknown
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Pith reviewed 2026-05-09 21:11 UTC · model grok-4.3
The pith
Ambiguous requirements degrade the correctness of code generated by all tested LLMs and cause them to produce inconsistent implementations for the same task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Orchid, the first benchmark of function-level code-generation tasks built around ambiguous requirements. It covers 1,304 tasks across four ambiguity categories: lexical, syntactic, semantic, and vagueness. Systematic evaluation of multiple LLMs on Orchid shows that ambiguity lowers pass rates for every model, with the largest absolute declines appearing in the most capable models. For any given ambiguous requirement the models frequently emit functionally different implementations, and they do not detect or resolve the ambiguity without external help.
What carries the argument
The Orchid benchmark of 1,304 function-level tasks, each constructed to contain one of four explicit ambiguity types, used to compare LLM outputs under clear versus uncertain natural-language specifications.
If this is right
- Every LLM tested loses accuracy when requirements contain ambiguity, so prompt-engineering or fine-tuning that assumes perfect clarity will overestimate real-world performance.
- Stronger models exhibit larger absolute losses, indicating that scaling alone does not confer robustness to unclear specifications.
- LLMs produce multiple functionally distinct implementations for the same ambiguous prompt, which raises the risk of silent failures in automated pipelines.
- Current models cannot autonomously flag or resolve ambiguity, so any reliable system must add an external clarification step.
Where Pith is reading between the lines
- Development teams that rely on LLMs for code generation will need explicit requirement-review stages before generation to avoid the observed performance penalty.
- Future benchmarks and evaluation suites for code LLMs should include ambiguous inputs as a standard stress test rather than treating clarity as the default.
- The observed divergence among outputs suggests that ensemble or voting methods may be less effective when the input itself is underspecified.
- Techniques that let an LLM request clarification from a human or from additional context sources could close part of the gap shown by Orchid.
Load-bearing premise
The four types of ambiguity built into Orchid reflect the ambiguities that actually appear in real software projects, and functional correctness can still be judged reliably even when the requirement statement is ambiguous.
What would settle it
Re-running the same models on a fresh collection of requirements taken directly from open-source issue trackers or industrial specifications, without any artificial ambiguity injection, and finding no drop in pass rates or no increase in output divergence.
Figures
read the original abstract
Software requirement ambiguity is ubiquitous in real-world development, stemming from the inherent imprecision of natural language and the varying interpretations of stakeholders. While Large Language Models (LLMs) have demonstrated impressive capabilities in generating code from precise specifications, such ambiguity poses a significant obstacle to reliable automated code generation. Existing benchmarks typically assume clear and unambiguous requirements, leaving an empirical gap in understanding how LLMs behave when faced with the inherent uncertainty of real-world software requirements. In this paper, we introduce Orchid, the first code generation benchmark specifically designed with ambiguous requirements. It comprises 1,304 function-level tasks covering four distinct types of ambiguity: lexical, syntactic, semantic, and vagueness. Leveraging this dataset, we conduct the first systematic empirical study to evaluate the impact of requirement ambiguity on LLM-based code generation. Our results demonstrate that ambiguity consistently degrades the performance of all evaluated LLMs, with the most pronounced negative effects observed in highly advanced models. Furthermore, we observe that LLMs frequently produce functionally divergent implementations for the same ambiguous requirement and lack the capability to identify or resolve such ambiguity autonomously. These findings reveal a significant performance gap between clear and ambiguous requirements, underscoring the urgent need for ambiguity-aware techniques in the next generation of automated software engineering tools. The Orchid benchmark is publicly available at https://huggingface.co/datasets/SII-YDD/Orchid.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Orchid, the first benchmark for LLM-based function-level code generation under ambiguous requirements, comprising 1,304 tasks across lexical, syntactic, semantic, and vagueness ambiguity types. Through systematic evaluation, it claims that ambiguity consistently degrades LLM performance (most pronounced in advanced models), that LLMs frequently generate functionally divergent implementations for the same ambiguous requirement, and that they lack the ability to autonomously identify or resolve such ambiguities. The benchmark is publicly released.
Significance. If the empirical results hold under rigorous validation of correctness measurement, this work fills a notable gap in existing code-generation benchmarks that assume precise specifications. The public release of Orchid enables reproducibility and follow-on research, while the finding of greater degradation in stronger models could usefully inform priorities for ambiguity-aware tooling in automated software engineering.
major comments (1)
- [Experimental Evaluation] The headline result of consistent performance degradation (and the claim that advanced models suffer most) depends on how functional correctness is scored for ambiguous requirements. The evaluation must demonstrate that pass/fail labels remain stable across plausible stakeholder interpretations rather than relying on a single reference implementation or narrow test suite derived from one disambiguation; otherwise the measured gap may be artifactual. This needs explicit treatment in the experimental setup and results sections, including any multi-interpretation validation performed.
minor comments (2)
- [Dataset Construction] Clarify the exact procedure used to construct the four ambiguity types in Orchid while ensuring the tasks remain realistic and the original functional intent is preserved where possible.
- [Results] Add statistical significance testing (e.g., paired tests or confidence intervals) for the reported performance differences between clear and ambiguous conditions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the opportunity to clarify our experimental design. We address the major comment below.
read point-by-point responses
-
Referee: [Experimental Evaluation] The headline result of consistent performance degradation (and the claim that advanced models suffer most) depends on how functional correctness is scored for ambiguous requirements. The evaluation must demonstrate that pass/fail labels remain stable across plausible stakeholder interpretations rather than relying on a single reference implementation or narrow test suite derived from one disambiguation; otherwise the measured gap may be artifactual. This needs explicit treatment in the experimental setup and results sections, including any multi-interpretation validation performed.
Authors: We agree that demonstrating the stability of pass/fail labels across plausible interpretations is essential to substantiate the headline claims. In the Orchid benchmark, test cases for each ambiguous requirement were constructed to target the intersection of core behaviors that should hold under multiple reasonable stakeholder interpretations for the given ambiguity type, rather than a single reference implementation. However, we acknowledge that the original manuscript did not provide an explicit sensitivity analysis or multi-interpretation validation to quantify label stability. We will revise the Experimental Setup section to detail the test-case design process and add a new analysis in the Results section reporting pass-rate variance across alternative interpretation-derived test suites. This revision will directly address the concern and strengthen the empirical foundation. revision: yes
Circularity Check
No circularity: empirical evaluation on released benchmark
full rationale
The paper introduces Orchid as a new benchmark with 1304 tasks across four ambiguity types and reports direct empirical results from evaluating multiple LLMs on clear vs. ambiguous versions. No equations, derivations, fitted parameters presented as predictions, or load-bearing self-citations appear in the provided text. Claims rest on observable pass/fail rates and output divergence measured against the released dataset, with no reduction of results to self-definitional inputs or prior author work invoked as uniqueness theorems. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Functional correctness of generated code can be assessed reliably even when the original requirement is ambiguous
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic. 2024. Introducing Claude 3.5 Sonnet. https://www.anthropic.com/ news/claude-3-5-sonnet
2024
- [3]
-
[4]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Muneera Bano. 2015. Addressing the challenges of requirements ambiguity: A review of empirical literature. In2015 IEEE fifth international workshop on empirical requirements engineering (EmpiRE). IEEE, 21–24
2015
-
[6]
Daniel M Berry and Erik Kamsties. 2004. Ambiguity in requirements specification. InPerspectives on software requirements. Springer, 7–44
2004
- [7]
-
[8]
Frederik P Brooks. 1987. Essence and accidents of software engineering.IEEE computer20, 4 (1987), 10–19
1987
-
[9]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
2020
-
[10]
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps- Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al. 2023. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation.IEEE Transactions on Software Engineering 49, 7 (2023), 3675–3691
2023
-
[11]
Federico Cassano, Luisa Li, Akul Sethi, Noah Shinn, Abby Brennan-Jones, Jacob Ginesin, Edward Berman, George Chakhnashvili, Anton Lozhkov, Carolyn Jane Anderson, et al. 2023. Can it edit? evaluating the ability of large language models to follow code editing instructions.arXiv preprint arXiv:2312.12450(2023)
-
[12]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2024. Evaluating large language models in class-level code generation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13
2024
-
[14]
Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh. 2022. TAPHSIR: towards AnaPHoric ambiguity detection and ReSolution in require- ments. InProceedings of the 30th ACM joint european software engineering confer- ence and symposium on the foundations of software engineering. 1677–1681
2022
-
[15]
Saad Ezzini, Sallam Abualhaija, Chetan Arora, Mehrdad Sabetzadeh, and Lionel C Briand. 2021. Using domain-specific corpora for improved handling of ambiguity in requirements. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1485–1497
2021
-
[16]
Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K Lahiri. 2024. Llm-based test-driven interactive code generation: User study and empirical evaluation.IEEE Transactions on Software Engineering(2024)
2024
-
[17]
Alessio Ferrari and Andrea Esuli. 2019. An NLP approach for cross-domain am- biguity detection in requirements engineering.Automated Software Engineering 26, 3 (2019), 559–598
2019
-
[18]
Jannik Fischbach, Julian Frattini, Daniel Mendez, Michael Unterkalmsteiner, Henning Femmer, and Andreas Vogelsang. 2021. How do practitioners interpret conditionals in requirements?. InProduct-Focused Software Process Improvement: 22nd International Conference, PROFES 2021, Turin, Italy, November 26, 2021, Proceedings 22. Springer, 85–102
2021
-
[19]
Emanuele Gentili and Davide Falessi. 2023. Characterizing Requirements Smells. InInternational Conference on Product-Focused Software Process Improvement. Springer, 387–398
2023
-
[20]
Vincenzo Gervasi and Didar Zowghi. 2005. Reasoning about inconsistencies in natural language requirements.ACM Transactions on Software Engineering and Methodology (TOSEM)14, 3 (2005), 277–330
2005
- [21]
-
[22]
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Syn- naeve, and Sida I Wang. 2024. Cruxeval: A benchmark for code reasoning, understanding and execution.arXiv preprint arXiv:2401.03065(2024)
work page internal anchor Pith review arXiv 2024
-
[23]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Mea- suring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021)
work page internal anchor Pith review arXiv 2021
-
[25]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, et al. 2024. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186(2024)
work page internal anchor Pith review arXiv 2024
-
[26]
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974(2024)
work page internal anchor Pith review arXiv 2024
- [27]
-
[28]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)
work page internal anchor Pith review arXiv 2023
-
[29]
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning. PMLR, 18319–18345
2023
-
[30]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems 36 (2023), 21558–21572
2023
-
[32]
Khalid Abdikarim Mohamed, Jamilah Din, and Salmi Baharom. 2022. A tool to detect pragmatic ambiguity with possible interpretations suggestion in software requirement specifications.International Journal of Synergy in Engineering and Technology3, 2 (2022), 52–60
2022
-
[33]
Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, ChenXue Wang, Shichao Liu, and Qing Wang. 2024. Clarifygpt: A framework for enhancing llm-based code generation via requirements clarification.Proceedings of the ACM 11 Di Yang, Xinou Xie, Xiuwen Yang, Ming Hu, Yihao Huang, Yueling Zhang, Weikai Miao, Ting Su, Chengcheng Wan, and Geguang Pu on Sof...
2024
-
[34]
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Sil- vio Savarese, and Caiming Xiong. 2022. Codegen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474 (2022)
work page internal anchor Pith review arXiv 2022
-
[35]
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2023. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950 (2023)
work page internal anchor Pith review arXiv 2023
-
[36]
Unnati S Shah and Devesh C Jinwala. 2015. Resolving ambiguities in natural lan- guage software requirements: a comprehensive survey.ACM SIGSOFT Software Engineering Notes40, 5 (2015), 1–7
2015
-
[37]
Jacline Sudah Sinpang, Shahida Sulaiman, and Norsham Idris. 2017. Detecting ambiguity in requirements analysis using mamdani fuzzy inference.Journal of Telecommunication, Electronic and Computer Engineering (JTEC)9, 3-4 (2017), 157–162
2017
-
[38]
KATO Toshiharu and Kazuhiko Tsuda. 2022. A method of ambiguity detection in requirement specifications by using a knowledge dictionary.Procedia Computer Science207 (2022), 1482–1489
2022
- [39]
- [40]
-
[41]
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig
-
[42]
InProceedings of the 15th international conference on mining software repositories
Learning to mine aligned code and natural language pairs from stack overflow. InProceedings of the 15th international conference on mining software repositories. 476–486
-
[43]
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al
-
[44]
Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877(2024). 12
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.