pith. the verified trust layer for science. sign in

arxiv: 2601.16456 · v2 · submitted 2026-01-23 · 💻 cs.SE

RubberDuckBench: A Benchmark for AI Coding Assistants

Pith reviewed 2026-05-16 12:13 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI coding assistantsbenchmarklarge language modelscode question answeringhallucinationsGitHub pull requestsevaluation
0
0 comments X p. Extension

The pith

Even the best AI coding assistants fail to provide consistent correct answers on a benchmark of real GitHub pull request questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RubberDuckBench, a benchmark built from actual questions programmers asked in GitHub pull request comments, paired with detailed scoring rubrics. When twenty different large language models are tested on these questions, the strongest performers reach only about 69 percent overall, with most of their points coming from partial answers rather than fully correct ones. The top models fully solve at most two questions across repeated trials and produce outright false information in 58.3 percent of responses on average. Performance shows no relation to model size or API cost.

Core claim

RubberDuckBench is a multilingual collection of contextualized questions about code drawn from GitHub pull request comments, together with rubrics for evaluating answers. Evaluation of twenty LLMs finds that even the highest-scoring models (Grok 4 at 69.29 percent, Claude Opus 4 at 68.5 percent, GPT-5 at 67.8 percent) show no statistically significant advantage over the next nine models, earn most credit through partial answers, fully solve at most two questions per model across trials, and hallucinate incorrect statements in 58.3 percent of responses on average, with no correlation between performance and expense.

What carries the argument

RubberDuckBench, a benchmark of real-world code questions extracted from GitHub pull request comments together with detailed evaluation rubrics.

If this is right

  • Top models do not show pairwise statistically significant superiority over the next nine best models.
  • Even the strongest models fully answer at most two questions correctly across all trials.
  • Models produce hallucinations in 58.3 percent of responses on average.
  • No correlation exists between model performance and either API price or parameter count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks focused on full correctness rather than partial credit may better expose reliability gaps.
  • High hallucination rates suggest current systems need built-in verification steps for code-related answers.
  • Developers relying on AI assistants for code questions should routinely check outputs against source material.

Load-bearing premise

The questions taken from GitHub pull request comments are representative of the kinds of questions programmers typically ask AI coding assistants.

What would settle it

An experiment in which a new model achieves scores above 80 percent and at least ten fully correct answers across repeated trials on an expanded set of similar pull-request questions would challenge the reported performance limits.

Figures

Figures reproduced from arXiv: 2601.16456 by Elizabeth Dinella, Fatma Ayad, Ferida Mohammed, Petros Maniatis, Satish Chandra.

Figure 1
Figure 1. Figure 1: PR comment exemplifying a contextualized question. string QtEventView :: getTimeSlicingType () const { return m_sliceTypeStrMap . at ( m_sliceType ) ; /* Question : Is there a difference between using m_sliceTypeStrMap .at( m_sliceType ) vs m_sliceTypeStrMap [ m_sliceType ]? */ } [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rubric: Difference Between Contextualized [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RQ1: Analysis of Model Performance significantly outperform other models (53.8%) on Performance ques￾tions, with all five models scoring above 65% and gpt-oss-20 leading with 73.4%. Anthropic models (71.2%) show an advantage on Li￾brary Behavior questions compared to all other models (61.8%), with Claude Opus 4.1 and Claude Opus 4 achieving 82.1% and 81.5% respectively. Finding 4. Models perform best on Li… view at source ↗
Figure 5
Figure 5. Figure 5: RQ2: Resource Usage of Proprietary and Open Source Models. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Point Deduction Per Error Type. 6 Conclusion In this paper, we present RubberDuckBench: a benchmark of 15 contextualized questions for AI coding assistants. Grok 4 is the high￾est performing (69.29%) but does not exhibit pairwise significant superiority over the next 12 best performing models. We find that models rarely responded with completely correct answers, with the best models only answering at most … view at source ↗
read the original abstract

Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary & open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials. Furthermore, models often hallucinate with lies in 58.3\% of responses on average. Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance. We intend this benchmark to be a target for future research in trustworthy and correct AI coding assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RubberDuckBench, a multilingual benchmark of real-world contextualized questions extracted from GitHub pull request comments, along with detailed rubrics for scoring. It evaluates 20 LLMs (proprietary and open-source) and reports that even top models achieve only 67-69% overall (Grok 4 at 69.29%, Claude Opus 4 at 68.5%, GPT-5 at 67.8%), with no pairwise significant superiority over the next nine models, most points earned via partial credit (at most two fully correct answers), an average hallucination rate of 58.3%, and no correlation between performance and cost or parameter count.

Significance. If the benchmark construction and scoring prove robust, the work supplies a practical, reproducible target for improving trustworthy AI coding assistants and documents concrete limitations in current systems on contextualized queries. The absence of a cost-performance link is a useful empirical observation for practitioners.

major comments (3)
  1. [§3] §3 (Benchmark Curation): The central claim that PR-comment questions are representative of typical programmer-AI interactions lacks supporting evidence such as distributional comparisons to IDE telemetry, Stack Overflow threads, or usage logs. This assumption is load-bearing for the reported scores and 58.3% hallucination rate; without it, generalizability remains unverified.
  2. [§4] §4 (Rubric and Scoring): The manuscript provides insufficient detail on rubric construction, inter-rater reliability statistics, and the exact criteria for awarding partial credit. Given that the headline result rests on aggregate scores driven largely by partial credit, these omissions prevent full assessment of scoring objectivity.
  3. [§5] §5 (Statistical Claims): The statement that the top three models show no pairwise significant superiority over the next nine requires explicit reporting of the test used, degrees of freedom, and adjusted p-values. The current presentation leaves the non-significance claim difficult to evaluate.
minor comments (2)
  1. [Abstract] Abstract and §5: The phrase 'hallucinate with lies' should be replaced by a precise operational definition of hallucination (e.g., factual fabrication vs. unsupported inference) to avoid ambiguity.
  2. [Tables] Tables: Ensure consistent formatting of model names, confidence intervals, and trial counts across all result tables and the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Curation): The central claim that PR-comment questions are representative of typical programmer-AI interactions lacks supporting evidence such as distributional comparisons to IDE telemetry, Stack Overflow threads, or usage logs. This assumption is load-bearing for the reported scores and 58.3% hallucination rate; without it, generalizability remains unverified.

    Authors: We acknowledge that the manuscript does not contain explicit distributional comparisons against IDE telemetry, Stack Overflow, or usage logs. The choice of GitHub PR comments was driven by their status as naturally occurring, context-rich questions arising during code review, a setting that frequently involves programmer-AI interaction. In the revision we will add a new subsection under §3 that (a) articulates this rationale, (b) explicitly states the absence of comparative distributional data, and (c) discusses the resulting limitations on generalizability together with concrete suggestions for future validation work. We do not claim the benchmark is universally representative; we present it as a reproducible target for contextualized coding queries. revision: partial

  2. Referee: [§4] §4 (Rubric and Scoring): The manuscript provides insufficient detail on rubric construction, inter-rater reliability statistics, and the exact criteria for awarding partial credit. Given that the headline result rests on aggregate scores driven largely by partial credit, these omissions prevent full assessment of scoring objectivity.

    Authors: We agree that the current description is insufficient. The revised manuscript will expand §4 with: (1) the iterative process by which the rubrics were constructed and refined by the author team, (2) inter-rater reliability statistics (Cohen’s kappa) computed on a held-out subset of responses scored independently by two annotators, and (3) the precise decision rules used for partial credit (e.g., awarding credit for correctly locating the relevant code region or identifying the core defect even when a complete fix is not supplied). These additions will make the scoring procedure fully auditable. revision: yes

  3. Referee: [§5] §5 (Statistical Claims): The statement that the top three models show no pairwise significant superiority over the next nine requires explicit reporting of the test used, degrees of freedom, and adjusted p-values. The current presentation leaves the non-significance claim difficult to evaluate.

    Authors: We will revise §5 to report the full statistical procedure: pairwise Wilcoxon rank-sum tests with Bonferroni correction for the 12 relevant comparisons, including test statistics, degrees of freedom, and adjusted p-values. The revised text will also include a brief justification for the choice of non-parametric test given the score distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark evaluation

full rationale

The paper curates questions from GitHub PR comments and runs direct LLM evaluations with rubrics, reporting accuracy, partial credit, hallucination rates, and cost correlations. No equations, fitted parameters, predictions, uniqueness theorems, or self-citation chains exist. All results follow from the test outcomes on the fixed dataset; the derivation chain is empty and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pull request comments form a representative sample of real programmer questions to AI assistants and that the hand-crafted rubrics provide a valid scoring standard.

axioms (1)
  • domain assumption Pull request comments on GitHub represent typical questions programmers ask AI coding assistants
    The benchmark construction begins from this premise to ensure ecological validity.

pith-pipeline@v0.9.0 · 5528 in / 1246 out tokens · 43827 ms · 2026-05-16T12:13:31.294578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    AlphaCode

    2025. AlphaCode. https://alphacode.deepmind.com/. Accessed: 2025-10-10

  2. [2]

    Claude 4

    2025. Claude 4. https://www.anthropic.com/news/claude-4

  3. [3]

    Claude Opus 4.1

    2025. Claude Opus 4.1. https://www.anthropic.com/news/claude-opus-4-1

  4. [4]

    Claude Sonnet

    2025. Claude Sonnet. https://www.anthropic.com/news/claude-3-7-sonnet

  5. [5]

    2025. Cursor. https://cursor.com/. Accessed: 2025-10-10

  6. [6]

    Gemini 2.5 Flash

    2025. Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/

  7. [7]

    Gemini 2.5 Pro

    2025. Gemini 2.5 Pro. https://deepmind.google/models/gemini/pro/

  8. [8]

    GitHub Copilot

    2025. GitHub Copilot. https://github.com/features/copilot. Accessed: 2025-10-10

  9. [9]

    2025. gpt-4.1. https://platform.openai.com/docs/models/gpt-4.1

  10. [10]

    2025. gpt-5. https://openai.com/index/introducing-gpt-5. Accessed: 2025-10-10

  11. [11]

    gpt-oss-120b

    2025. gpt-oss-120b. https://platform.openai.com/docs/models/gpt-oss-120b

  12. [12]

    gpt-oss-20b

    2025. gpt-oss-20b. https://platform.openai.com/docs/models/gpt-oss-20b

  13. [13]

    2025. Grok 3. https://x.ai/news/grok-3. Accessed: 2025-10-10

  14. [14]

    2025. Grok 4. https://x.ai/news/grok-4. Accessed: 2025-10-10

  15. [15]

    2025. Kiro. https://kiro.dev/. Accessed: 2025-10-10

  16. [16]

    llama-3.3

    2025. llama-3.3. https://www.llama.com/models/llama-3/. Accessed: 2025-10-10

  17. [17]

    2025. llama-4. https://www.llama.com/models/llama-4/. Accessed: 2025-10-10

  18. [18]

    2025. Mistral. https://mistral.ai/news/mistral-large. Accessed: 2025-10-10

  19. [19]

    2025. o3. https://platform.openai.com/docs/models/o3. Accessed: 2025-10-10

  20. [20]

    2025. Tabnine. https://www.tabnine.com/. Accessed: 2025-10-10

  21. [21]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732(2021)

  22. [22]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavar...

  23. [23]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  24. [24]

    Ria Galanos, Timothy Gallagher, and Briana Morrison. 2019. An Afternoon with an AP Computer Science A Exam Reader. InProceedings of the 50th ACM Technical Symposium on Computer Science Education(Minneapolis, MN, USA) (SIGCSE ’19). Association for Computing Machinery, New York, NY, USA, 1242. doi:10.1145/3287324.3287549

  25. [25]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770 [cs.CL]

  26. [26]

    Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2024. Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv:2311.16169 [cs.CR] https://arxiv.org/ abs/2311.16169

  27. [27]

    Zhiyu Li, Shuai Lu, Daya Guo, Nan Duan, Shailesh Jannu, Grant Jenks, Deep Majumder, Jared Green, Alexey Svyatkovskiy, Shengyu Fu, and Neel Sundare- san. 2022. Automating Code Review Activities by Large-Scale Pre-training. arXiv:2203.09095 [cs.SE]

  28. [28]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understan...

  29. [29]

    Elise Paradis, Kate Grey, Quinn Madison, Daye Nam, Andrew Macvean, Vahid Meimand, Nan Zhang, Ben Ferrari-Church, and Satish Chandra. 2024. How much does AI impact development speed? An enterprise-based randomized controlled trial. arXiv:2410.12944 [cs.SE] https://arxiv.org/abs/2410.12944

  30. [30]

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2025. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv:2402.07927 [cs.AI] https: //arxiv.org/abs/2402.07927

  31. [31]

    Nidhish Shah, Zulkuf Genc, and Dogu Araci. 2024. StackEval: Benchmarking LLMs in Coding Assistance. arXiv:2412.05288 [cs.SE] https://arxiv.org/abs/2412. 05288

  32. [32]

    Patrick E Shrout and Joseph L Fleiss. 1979. Intraclass correlations: uses in assess- ing rater reliability.Psychological Bulletin86, 2 (1979), 420–428

  33. [33]

    André Silva and Martin Monperrus. 2024. RepairBench: Leaderboard of Frontier Models for Program Repair. arXiv:2409.18952 [cs.SE] https://arxiv.org/abs/2409. 18952

  34. [34]

    Stack Overflow. 2025. Stack Overflow Developer Survey 2025. https://survey. stackoverflow.co/2025/. Accessed: 10-06-2025

  35. [35]

    Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, and Martin Vechev. 2025. BaxBench: Can LLMs Generate Correct and Secure Backends? arXiv:2502.11844 [cs.CR] https://arxiv. org/abs/2502.11844

  36. [36]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agar- wal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 248...

  37. [37]

    Shihao Xia, Mengting He, Linhai Song, and Yiying Zhang. 2025. SC-Bench: A Large-Scale Dataset for Smart Contract Auditing. arXiv:2410.06176 [cs.CR] https://arxiv.org/abs/2410.06176

  38. [38]

    Kai Xu, YiWei Mao, XinYi Guan, and ZiLong Feng. 2025. Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks. arXiv:2505.07473 [cs.AI] https://arxiv.org/abs/2505.07473

  39. [39]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  40. [40]

    Quanjun Zhang, Chunrong Fang, Yang Xie, Yaxin Zhang, Yun Yang, Weisong Sun, Shengcheng Yu, and Zhenyu Chen. 2024. A Survey on Large Language Models for Software Engineering. arXiv:2312.15223 [cs.SE] https://arxiv.org/abs/2312.15223

  41. [41]

    Li Zhong and Zilong Wang. 2023. Can ChatGPT replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. arXiv:2308.10335 [cs.CL]

  42. [42]

    Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, and Edward Aftandilian. 2022. Productivity Assessment of Neural Code Completion. arXiv:2205.06537 [cs.SE] https://arxiv. org/abs/2205.06537