arxiv: 2601.16456 · v2 · submitted 2026-01-23 · 💻 cs.SE

RubberDuckBench: A Benchmark for AI Coding Assistants

Ferida Mohammed , Fatma Ayad , Petros Maniatis , Satish Chandra , Elizabeth Dinella This is my paper

Pith reviewed 2026-05-16 12:13 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding assistantsbenchmarklarge language modelscode question answeringhallucinationsGitHub pull requestsevaluation

0 comments p. Extension

The pith

Even the best AI coding assistants fail to provide consistent correct answers on a benchmark of real GitHub pull request questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RubberDuckBench, a benchmark built from actual questions programmers asked in GitHub pull request comments, paired with detailed scoring rubrics. When twenty different large language models are tested on these questions, the strongest performers reach only about 69 percent overall, with most of their points coming from partial answers rather than fully correct ones. The top models fully solve at most two questions across repeated trials and produce outright false information in 58.3 percent of responses on average. Performance shows no relation to model size or API cost.

Core claim

RubberDuckBench is a multilingual collection of contextualized questions about code drawn from GitHub pull request comments, together with rubrics for evaluating answers. Evaluation of twenty LLMs finds that even the highest-scoring models (Grok 4 at 69.29 percent, Claude Opus 4 at 68.5 percent, GPT-5 at 67.8 percent) show no statistically significant advantage over the next nine models, earn most credit through partial answers, fully solve at most two questions per model across trials, and hallucinate incorrect statements in 58.3 percent of responses on average, with no correlation between performance and expense.

What carries the argument

RubberDuckBench, a benchmark of real-world code questions extracted from GitHub pull request comments together with detailed evaluation rubrics.

If this is right

Top models do not show pairwise statistically significant superiority over the next nine best models.
Even the strongest models fully answer at most two questions correctly across all trials.
Models produce hallucinations in 58.3 percent of responses on average.
No correlation exists between model performance and either API price or parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks focused on full correctness rather than partial credit may better expose reliability gaps.
High hallucination rates suggest current systems need built-in verification steps for code-related answers.
Developers relying on AI assistants for code questions should routinely check outputs against source material.

Load-bearing premise

The questions taken from GitHub pull request comments are representative of the kinds of questions programmers typically ask AI coding assistants.

What would settle it

An experiment in which a new model achieves scores above 80 percent and at least ten fully correct answers across repeated trials on an expanded set of similar pull-request questions would challenge the reported performance limits.

Figures

Figures reproduced from arXiv: 2601.16456 by Elizabeth Dinella, Fatma Ayad, Ferida Mohammed, Petros Maniatis, Satish Chandra.

**Figure 1.** Figure 1: PR comment exemplifying a contextualized question. string QtEventView :: getTimeSlicingType () const { return m_sliceTypeStrMap . at ( m_sliceType ) ; /* Question : Is there a difference between using m_sliceTypeStrMap .at( m_sliceType ) vs m_sliceTypeStrMap [ m_sliceType ]? */ } [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Rubric: Difference Between Contextualized [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: RQ1: Analysis of Model Performance significantly outperform other models (53.8%) on Performance questions, with all five models scoring above 65% and gpt-oss-20 leading with 73.4%. Anthropic models (71.2%) show an advantage on Library Behavior questions compared to all other models (61.8%), with Claude Opus 4.1 and Claude Opus 4 achieving 82.1% and 81.5% respectively. Finding 4. Models perform best on Li… view at source ↗

**Figure 5.** Figure 5: RQ2: Resource Usage of Proprietary and Open Source Models. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Point Deduction Per Error Type. 6 Conclusion In this paper, we present RubberDuckBench: a benchmark of 15 contextualized questions for AI coding assistants. Grok 4 is the highest performing (69.29%) but does not exhibit pairwise significant superiority over the next 12 best performing models. We find that models rarely responded with completely correct answers, with the best models only answering at most … view at source ↗

read the original abstract

Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials. Furthermore, models often hallucinate with lies in 58.3\% of responses on average. Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance. We intend this benchmark to be a target for future research in trustworthy and correct AI coding assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RubberDuckBench pulls real GitHub PR comments into a new multilingual benchmark with rubrics, and the results show top models still top out around 68% with heavy hallucination and no cost edge.

read the letter

The paper's core move is to build RubberDuckBench from actual pull-request comments rather than synthetic questions. They add detailed rubrics, cover multiple languages, and run twenty models across it. The headline numbers are straightforward: Grok 4 leads at 69%, followed closely by Claude Opus 4 and GPT-5, but none of the top ten show statistically clear separation. Most credit comes from partial answers, complete correctness is rare even for the leaders, and average hallucination sits at 58%. Cost and size show no reliable link to scores. That package is the useful part. It gives the field a concrete, released target that feels closer to real code-review questions than many earlier datasets. The evaluation is broad enough to make the consistency and hallucination findings credible on their own terms. The main soft spot is the untested claim that PR comments match the distribution of questions programmers actually ask coding assistants. Those comments can skew toward reviewer-style issues with more context and edge cases, so the reported failure rates might not generalize to quick IDE queries or simpler debugging. No comparison to usage logs or Stack Overflow threads is shown to close that gap. The abstract also leaves rubric construction and inter-rater checks a bit thin, though the overall empirical framing holds up. This is for people who build or benchmark AI coding tools and want a practical yardstick beyond synthetic tests. A reader working on reliability or evaluation will find the numbers and the dataset worth looking at. It deserves a serious referee because the data source is fresh, the setup is reproducible, and the central claims rest on direct measurement rather than circular modeling.

Referee Report

3 major / 2 minor

Summary. The paper introduces RubberDuckBench, a multilingual benchmark of real-world contextualized questions extracted from GitHub pull request comments, along with detailed rubrics for scoring. It evaluates 20 LLMs (proprietary and open-source) and reports that even top models achieve only 67-69% overall (Grok 4 at 69.29%, Claude Opus 4 at 68.5%, GPT-5 at 67.8%), with no pairwise significant superiority over the next nine models, most points earned via partial credit (at most two fully correct answers), an average hallucination rate of 58.3%, and no correlation between performance and cost or parameter count.

Significance. If the benchmark construction and scoring prove robust, the work supplies a practical, reproducible target for improving trustworthy AI coding assistants and documents concrete limitations in current systems on contextualized queries. The absence of a cost-performance link is a useful empirical observation for practitioners.

major comments (3)

[§3] §3 (Benchmark Curation): The central claim that PR-comment questions are representative of typical programmer-AI interactions lacks supporting evidence such as distributional comparisons to IDE telemetry, Stack Overflow threads, or usage logs. This assumption is load-bearing for the reported scores and 58.3% hallucination rate; without it, generalizability remains unverified.
[§4] §4 (Rubric and Scoring): The manuscript provides insufficient detail on rubric construction, inter-rater reliability statistics, and the exact criteria for awarding partial credit. Given that the headline result rests on aggregate scores driven largely by partial credit, these omissions prevent full assessment of scoring objectivity.
[§5] §5 (Statistical Claims): The statement that the top three models show no pairwise significant superiority over the next nine requires explicit reporting of the test used, degrees of freedom, and adjusted p-values. The current presentation leaves the non-significance claim difficult to evaluate.

minor comments (2)

[Abstract] Abstract and §5: The phrase 'hallucinate with lies' should be replaced by a precise operational definition of hallucination (e.g., factual fabrication vs. unsupported inference) to avoid ambiguity.
[Tables] Tables: Ensure consistent formatting of model names, confidence intervals, and trial counts across all result tables and the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (Benchmark Curation): The central claim that PR-comment questions are representative of typical programmer-AI interactions lacks supporting evidence such as distributional comparisons to IDE telemetry, Stack Overflow threads, or usage logs. This assumption is load-bearing for the reported scores and 58.3% hallucination rate; without it, generalizability remains unverified.

Authors: We acknowledge that the manuscript does not contain explicit distributional comparisons against IDE telemetry, Stack Overflow, or usage logs. The choice of GitHub PR comments was driven by their status as naturally occurring, context-rich questions arising during code review, a setting that frequently involves programmer-AI interaction. In the revision we will add a new subsection under §3 that (a) articulates this rationale, (b) explicitly states the absence of comparative distributional data, and (c) discusses the resulting limitations on generalizability together with concrete suggestions for future validation work. We do not claim the benchmark is universally representative; we present it as a reproducible target for contextualized coding queries. revision: partial
Referee: [§4] §4 (Rubric and Scoring): The manuscript provides insufficient detail on rubric construction, inter-rater reliability statistics, and the exact criteria for awarding partial credit. Given that the headline result rests on aggregate scores driven largely by partial credit, these omissions prevent full assessment of scoring objectivity.

Authors: We agree that the current description is insufficient. The revised manuscript will expand §4 with: (1) the iterative process by which the rubrics were constructed and refined by the author team, (2) inter-rater reliability statistics (Cohen’s kappa) computed on a held-out subset of responses scored independently by two annotators, and (3) the precise decision rules used for partial credit (e.g., awarding credit for correctly locating the relevant code region or identifying the core defect even when a complete fix is not supplied). These additions will make the scoring procedure fully auditable. revision: yes
Referee: [§5] §5 (Statistical Claims): The statement that the top three models show no pairwise significant superiority over the next nine requires explicit reporting of the test used, degrees of freedom, and adjusted p-values. The current presentation leaves the non-significance claim difficult to evaluate.

Authors: We will revise §5 to report the full statistical procedure: pairwise Wilcoxon rank-sum tests with Bonferroni correction for the 12 relevant comparisons, including test statistics, degrees of freedom, and adjusted p-values. The revised text will also include a brief justification for the choice of non-parametric test given the score distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper curates questions from GitHub PR comments and runs direct LLM evaluations with rubrics, reporting accuracy, partial credit, hallucination rates, and cost correlations. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation chains exist. All results follow from the test outcomes on the fixed dataset; the derivation chain is empty and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pull request comments form a representative sample of real programmer questions to AI assistants and that the hand-crafted rubrics provide a valid scoring standard.

axioms (1)

domain assumption Pull request comments on GitHub represent typical questions programmers ask AI coding assistants
The benchmark construction begins from this premise to ensure ecological validity.

pith-pipeline@v0.9.0 · 5528 in / 1246 out tokens · 43827 ms · 2026-05-16T12:13:31.294578+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

AlphaCode

2025. AlphaCode. https://alphacode.deepmind.com/. Accessed: 2025-10-10

work page 2025
[2]

Claude 4

2025. Claude 4. https://www.anthropic.com/news/claude-4

work page 2025
[3]

Claude Opus 4.1

2025. Claude Opus 4.1. https://www.anthropic.com/news/claude-opus-4-1

work page 2025
[4]

Claude Sonnet

2025. Claude Sonnet. https://www.anthropic.com/news/claude-3-7-sonnet

work page 2025
[5]

2025. Cursor. https://cursor.com/. Accessed: 2025-10-10

work page 2025
[6]

Gemini 2.5 Flash

2025. Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/

work page 2025
[7]

Gemini 2.5 Pro

2025. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/

work page 2025
[8]

GitHub Copilot

2025. GitHub Copilot. https://github.com/features/copilot. Accessed: 2025-10-10

work page 2025
[9]

2025. gpt-4.1. https://platform.openai.com/docs/models/gpt-4.1

work page 2025
[10]

2025. gpt-5. https://openai.com/index/introducing-gpt-5. Accessed: 2025-10-10

work page 2025
[11]

gpt-oss-120b

2025. gpt-oss-120b. https://platform.openai.com/docs/models/gpt-oss-120b

work page 2025
[12]

gpt-oss-20b

2025. gpt-oss-20b. https://platform.openai.com/docs/models/gpt-oss-20b

work page 2025
[13]

2025. Grok 3. https://x.ai/news/grok-3. Accessed: 2025-10-10

work page 2025
[14]

2025. Grok 4. https://x.ai/news/grok-4. Accessed: 2025-10-10

work page 2025
[15]

2025. Kiro. https://kiro.dev/. Accessed: 2025-10-10

work page 2025
[16]

llama-3.3

2025. llama-3.3. https://www.llama.com/models/llama-3/. Accessed: 2025-10-10

work page 2025
[17]

2025. llama-4. https://www.llama.com/models/llama-4/. Accessed: 2025-10-10

work page 2025
[18]

2025. Mistral. https://mistral.ai/news/mistral-large. Accessed: 2025-10-10

work page 2025
[19]

2025. o3. https://platform.openai.com/docs/models/o3. Accessed: 2025-10-10

work page 2025
[20]

2025. Tabnine. https://www.tabnine.com/. Accessed: 2025-10-10

work page 2025
[21]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Ria Galanos, Timothy Gallagher, and Briana Morrison. 2019. An Afternoon with an AP Computer Science A Exam Reader. InProceedings of the 50th ACM Technical Symposium on Computer Science Education(Minneapolis, MN, USA) (SIGCSE ’19). Association for Computing Machinery, New York, NY, USA, 1242. doi:10.1145/3287324.3287549

work page doi:10.1145/3287324.3287549 2019
[25]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2024. Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv:2311.16169 [cs.CR] https://arxiv.org/ abs/2311.16169

work page arXiv 2024
[27]

Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- san. 2022. Automating Code Review Activities by Large-Scale Pre-training. arXiv:2203.09095 [cs.SE]

work page arXiv 2022
[28]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understan...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Elise Paradis, Kate Grey, Quinn Madison, Daye Nam, Andrew Macvean, Vahid Meimand, Nan Zhang, Ben Ferrari-Church, and Satish Chandra. 2024. How much does AI impact development speed? An enterprise-based randomized controlled trial. arXiv:2410.12944 [cs.SE] https://arxiv.org/abs/2410.12944

work page arXiv 2024
[30]

Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2025. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927 [cs.AI] https: //arxiv.org/abs/2402.07927

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Nidhish Shah, Zulkuf Genc, and Dogu Araci. 2024. StackEval: Benchmarking LLMs in Coding Assistance. arXiv:2412.05288 [cs.SE] https://arxiv.org/abs/2412. 05288

work page arXiv 2024
[32]

Patrick E Shrout and Joseph L Fleiss. 1979. Intraclass correlations: uses in assess- ing rater reliability.Psychological Bulletin86, 2 (1979), 420–428

work page 1979
[33]

André Silva and Martin Monperrus. 2024. RepairBench: Leaderboard of Frontier Models for Program Repair. arXiv:2409.18952 [cs.SE] https://arxiv.org/abs/2409. 18952

work page arXiv 2024
[34]

Stack Overflow. 2025. Stack Overflow Developer Survey 2025. https://survey. stackoverflow.co/2025/. Accessed: 10-06-2025

work page 2025
[35]

Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. 2025. BaxBench: Can LLMs Generate Correct and Secure Backends? arXiv:2502.11844 [cs.CR] https://arxiv. org/abs/2502.11844

work page arXiv 2025
[36]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 248...

work page 2022
[37]

Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. 2025. SC-Bench: A Large-Scale Dataset for Smart Contract Auditing. arXiv:2410.06176 [cs.CR] https://arxiv.org/abs/2410.06176

work page arXiv 2025
[38]

Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. 2025. Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks. arXiv:2505.07473 [cs.AI] https://arxiv.org/abs/2505.07473

work page arXiv 2025
[39]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. 2024. A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE] https://arxiv.org/abs/2312.15223

work page arXiv 2024
[41]

Li Zhong and Zilong Wang. 2023. Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. arXiv:2308.10335 [cs.CL]

work page arXiv 2023
[42]

Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, and Edward Aftandilian. 2022. Productivity Assessment of Neural Code Completion. arXiv:2205.06537 [cs.SE] https://arxiv. org/abs/2205.06537

work page arXiv 2022