pith. machine review for the scientific record. sign in

arxiv: 2604.05203 · v1 · submitted 2026-04-06 · 💻 cs.HC

Recognition: no theorem link

Decision-Oriented Programming with Aporia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:47 UTC · model grok-4.3

classification 💻 cs.HC
keywords decision-oriented programmingAI coding agentsdesign decisionsmental modelsuser studytest suiteshuman-AI collaborationprogramming tools
0
0 comments X

The pith

Explicit decision tracking in AI-assisted programming keeps programmers engaged and forms more accurate mental models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes decision-oriented programming as a paradigm where key design decisions serve as the shared, explicit medium between programmers and AI agents. Decisions are structured, interactively elicited by the agent, and each one is directly traceable to the generated code. Aporia implements this vision with a persistent Decision Bank for storing and editing decisions, proactive design questions to elicit them, and conversion of each decision into an executable test suite for validation. In a study of 14 programmers, the approach increased engagement with design, supported exploration and validation steps, and produced mental models five times less likely to disagree with the actual code than those formed using a standard coding agent.

Core claim

Decision-oriented programming treats decisions as the primary interface between the programmer and the AI agent. Decisions are made explicit and structured, co-authored through interactive elicitation, and each one is traceable back to the resulting code. Aporia realizes this by maintaining a Decision Bank of persistent, editable decisions; proactively asking programmers design questions; and encoding decisions as executable test suites that validate the implementation. In a user study, this approach increased engagement in the design process, supported exploration and validation, and led to significantly more accurate programmer understanding of the code.

What carries the argument

The Decision Bank, which stores persistent, editable decisions that the agent elicits through questions and encodes as executable test suites linking intent directly to implementation.

Load-bearing premise

The observed gains in engagement and mental model accuracy from the 14-participant study will generalize and stem specifically from explicit decision tracking rather than other differences in the interface or agent behavior.

What would settle it

A controlled follow-up study with a larger group of programmers where a baseline agent asks similar design questions but omits the structured Decision Bank and test encoding, then measures whether mental model accuracy and engagement differences disappear.

Figures

Figures reproduced from arXiv: 2604.05203 by Benjamin C. Pierce, Harrison Goldstein, Hila Peleg, Nadia Polikarpova, Raven Rothkopf, Saketh Ram Kasibatla, Sorin Lerner.

Figure 1
Figure 1. Figure 1: We describe Decision-Oriented Programming, a paradigm for supporting human decision-making in AI-assisted [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example paper details page in NotCRP. behind design choices [34, 38, 42]. Specifically, Questions, Options, Criteria (QOC) notation [38] represents design spaces in terms of Questions that identify key issues, Options that provide possible answers, and Criteria for evaluating those options. More broadly, the design rationale tradition argues for making design reason￾ing explicit and persistent as a “cop… view at source ↗
Figure 3
Figure 3. Figure 3: Aporia is a VS Code extension that supports decision-oriented programming. Users enter their high-level goal into the Goal Field 1 , which prompts Aporia to generate design questions as Question Bubbles 2 . Clicking on a question opens the Question Detail View 3 , which includes Aporia’s argument for the question, as well as references to relevant code. Users can answer the question with a yes/no response … view at source ↗
Figure 4
Figure 4. Figure 4: Sample correctness analysis of post-task survey responses from P12 (Task A, left) and P3 (Task B, right). Free [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of participants’ post-study survey lik [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Participants’ self-reported decision-making strate [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Participants’ self-reported validation strategies. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

AI agents allow developers to express computational intent abstractly, reducing cognitive effort and helping achieve flow during programming. Increased abstraction, however, comes at a cost: developers cede decision-making authority to agents, often without realizing that important design decisions are being made without them. We aim to bring these decisions to the foreground in a paradigm we dub decision-oriented programming. In DOP, (1) decisions are explicit and structured, serving as the shared medium between the programmer and the agent; (2) decisions are co-authored interactively, with the agent proactively eliciting them from the programmer; and (3) each decision is traceable to code. As a step towards this vision, we have built Aporia, a design probe that tracks decisions in a persistent, editable Decision Bank; elicits them by asking programmers design questions; and encodes each decision as an executable test suite that can be used to validate the implementation. In a user study of 14 programmers, Aporia increased engagement in the design process and scaffolded both exploration and validation. Participants also gained a more accurate understanding of their implementations, with their mental models 5x less likely to disagree with the code than a baseline coding agent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces decision-oriented programming (DOP) as a paradigm in which design decisions are explicit and structured as the shared medium between programmer and AI agent, co-authored interactively, and traceable to code. It presents Aporia, a design probe implementing DOP via a persistent editable Decision Bank, proactive elicitation through design questions, and encoding of each decision as an executable test suite for validation. A user study with 14 programmers reports that Aporia increases engagement in the design process, scaffolds exploration and validation, and produces mental models 5x less likely to disagree with the code than a baseline coding agent.

Significance. If the user-study results hold after methodological clarification, the work could meaningfully advance human-AI collaboration in programming by foregrounding decision authority rather than abstracting it away. The design-probe approach and quantitative outcome on mental-model accuracy provide a concrete, falsifiable starting point for tools that aim to preserve developer understanding and agency.

major comments (2)
  1. [Abstract and User Study] Abstract and User Study section: the central empirical claim of a '5x' reduction in mental-model disagreement with the code is load-bearing for the paper's contribution, yet the manuscript supplies no description of how mental models were elicited or measured, how disagreement was operationalized or scored, the exact baseline agent implementation, participant selection or demographics, task details, session length, or any statistical analysis supporting the 5x factor. These omissions prevent evaluation of the result's validity.
  2. [User Study] User Study section: the comparison to the baseline coding agent must demonstrate that the baseline was matched on all interface and capability dimensions except explicit decision tracking; without this, it is impossible to isolate the effect of the Decision Bank and elicitation features from other differences that could drive the observed engagement and accuracy gains.
minor comments (2)
  1. [Introduction] The three DOP principles are listed in the abstract but would benefit from a concise table or enumerated list in the introduction to make their distinctions from related ideas (e.g., test-driven development) immediately clear.
  2. [Abstract] The phrase '5x less likely' should be accompanied by the raw rates or probabilities (e.g., baseline disagreement rate vs. Aporia rate) so readers can assess the practical magnitude.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The comments correctly identify gaps in methodological reporting that limit evaluability of the user-study claims. We have revised the manuscript to supply the requested details and clarifications. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract and User Study] Abstract and User Study section: the central empirical claim of a '5x' reduction in mental-model disagreement with the code is load-bearing for the paper's contribution, yet the manuscript supplies no description of how mental models were elicited or measured, how disagreement was operationalized or scored, the exact baseline agent implementation, participant selection or demographics, task details, session length, or any statistical analysis supporting the 5x factor. These omissions prevent evaluation of the result's validity.

    Authors: We agree that the original manuscript omitted critical methodological details required to assess the 5x claim. This was a reporting oversight. The revised User Study section now provides: (1) the exact protocol for eliciting mental models (post-task structured interviews plus code-behavior prediction tasks), (2) the operationalization and scoring of disagreement (binary mismatch between participant prediction and observed execution trace, aggregated across tasks), (3) the baseline agent specification (identical underlying LLM and chat interface, differing only in the absence of the Decision Bank and proactive elicitation), (4) participant recruitment, demographics, and screening criteria, (5) task descriptions and session durations, and (6) the descriptive frequency analysis underlying the 5x ratio together with its limitations. These additions enable readers to evaluate internal validity. revision: yes

  2. Referee: [User Study] User Study section: the comparison to the baseline coding agent must demonstrate that the baseline was matched on all interface and capability dimensions except explicit decision tracking; without this, it is impossible to isolate the effect of the Decision Bank and elicitation features from other differences that could drive the observed engagement and accuracy gains.

    Authors: We concur that isolating the contribution of explicit decision tracking requires the baseline to be matched on all other interface and capability dimensions. The revised manuscript now includes a dedicated subsection describing the baseline implementation and the controls used to equate it with Aporia on chat latency, code-generation quality, UI layout, and available commands. We also added a limitations paragraph discussing residual confounds and the rationale for the chosen between-subjects design. These changes make the attribution argument explicit and falsifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with no derivation chain

full rationale

This paper presents an empirical design-probe study of the Aporia system rather than any mathematical derivation, first-principles result, or fitted model. The central claims rest on a 14-participant user study measuring engagement, scaffolding, and mental-model accuracy; these outcomes are reported directly from participant data and do not reduce by construction to prior definitions, self-citations, or fitted parameters. No equations, ansatzes, uniqueness theorems, or self-referential predictions appear in the provided abstract or study description. The work is therefore self-contained against external benchmarks (the study itself) with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about the value of explicit decisions in human-AI programming collaboration and on the effectiveness of encoding decisions as test suites; no free parameters or invented entities with independent evidence are introduced beyond the system itself.

axioms (2)
  • domain assumption Making design decisions explicit and interactive improves programmer engagement and mental model accuracy
    Core premise of the DOP paradigm stated in the abstract
  • ad hoc to paper Encoding each decision as an executable test suite provides effective validation
    Described as one of the three core properties of Aporia
invented entities (2)
  • Decision Bank no independent evidence
    purpose: Persistent, editable store for decisions that serves as shared medium between programmer and agent
    New construct introduced to track and surface decisions
  • Aporia design probe no independent evidence
    purpose: Implementation that elicits decisions via questions and encodes them as tests
    The concrete artifact built to explore the DOP vision

pith-pipeline@v0.9.0 · 5525 in / 1550 out tokens · 56234 ms · 2026-05-10T18:47:42.842125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    [n. d.]. Agent Context Protocol. https://zed.dev/acp

  2. [2]

    [n. d.]. Model Context Protocol. https://modelcontextprotocol.io/

  3. [3]

    [n. d.]. Visual Studio Code. https://code.visualstudio.com

  4. [4]

    2026.Claude Code CLI

    Anthropic. 2026.Claude Code CLI. https://code.claude.com/docs/en/cli-reference

  5. [5]

    2026.Introducing Claude Sonnet 4.6

    Anthropic. 2026.Introducing Claude Sonnet 4.6. https://www.anthropic.com/ news/claude-sonnet-4-6

  6. [6]

    James, and Nadia Polikarpova

    Shraddha Barke, Michael B. James, and Nadia Polikarpova. 2023. Grounded Copilot: How Programmers Interact with Code-Generating Models.Proceedings of the ACM on Programming Languages7, OOPSLA1 (April 2023), 85–111. doi:10. 1145/3586030

  7. [7]

    2002.Test Driven Development

    Kent Beck. 2002.Test Driven Development. By Example. Addison-Wesley Longman, Amsterdam

  8. [8]

    2015.Test-driven development: by example(20

    Kent Beck. 2015.Test-driven development: by example(20. printing ed.). Addison- Wesley, Boston

  9. [9]

    Markus Borg, Dave Hewett, Nadim Hagatulah, Noric Couderc, Emma Söderberg, Donald Graham, Uttam Kini, and Dave Farley. 2026. Echoes of AI: Investi- gating the Downstream Effects of AI Assistants on Software Maintainability. arXiv:2507.00788 [cs.SE] https://arxiv.org/abs/2507.00788

  10. [10]

    Dibyendu Brinto Bose. 2025. From prompts to properties: Rethinking llm code generation with property-based testing. InProceedings of the 33rd ACM Interna- tional Conference on the Foundations of Software Engineering. 1660–1665

  11. [11]

    Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology3, 2 (2006), 77–101

  12. [12]

    Ruijia Cheng, Titus Barik, Alan Leung, Fred Hohman, and Jeffrey Nichols. 2024. BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks. In2024 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 13–23. doi:10.1109/VL/HCC60511.2024.00012

  13. [13]

    Koen Claessen and John Hughes. 2000. QuickCheck: a lightweight tool for random testing of Haskell programs. InProceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming (ICFP ’00), Montreal, Canada, September 18-21, 2000, Martin Odersky and Philip Wadler (Eds.). ACM, 268–279. doi:10.1145/351240.351266

  14. [14]

    2026.code-server: VS Code in the browser

    Coder. 2026.code-server: VS Code in the browser. https://github.com/coder/code- server

  15. [15]

    Cursor. 2023. Cursor. https://cursor.com

  16. [16]

    Sarah Fakhoury, Aaditya Naik, Georgios Sakkas, Saikat Chakraborty, and Shu- vendu K. Lahiri. 2024. LLM-Based Test-Driven Interactive Code Generation: User Study and Empirical Evaluation.IEEE Transactions on Software Engineering50, 9 (2024), 2254–2268. doi:10.1109/TSE.2024.3428972

  17. [17]

    Muhammad Shoaib Farooq, Uzma Omer, Amna Ramzan, Mansoor Ahmad Rasheed, and Zabihullah Atal. 2023. Behavior Driven Development: A Sys- tematic Literature Review.IEEE Access11 (2023), 88008–88024. doi:10.1109/ ACCESS.2023.3302356

  18. [18]

    Ahmed Fawzy, Amjed Tahir, and Kelly Blincoe. 2025. Vibe Coding in Prac- tice: Motivations, Challenges, and a Future Outlook – a Grey Literature Review. arXiv:2510.00328 [cs.SE] https://arxiv.org/abs/2510.00328

  19. [19]

    K. J. Kevin Feng, Kevin Pu, Matt Latzke, Tal August, Pao Siangliulue, Jonathan Bragg, Daniel S. Weld, Amy X. Zhang, and Joseph Chee Chang. 2026. Cocoa: Co-Planning and Co-Execution with AI Agents. arXiv:2412.10999 [cs.HC] https: //arxiv.org/abs/2412.10999

  20. [20]

    InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24)

    Katy Ilonka Gero, Chelse Swoopes, Ziwei Gu, Jonathan K. Kummerfeld, and Elena L. Glassman. 2024. Supporting Sensemaking of Large Language Model Outputs at Scale. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11-16, 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie R. Williamson, Corina Sas, Ma...

  21. [21]

    Emmanuel Anaya González, Raven Rothkopf, Sorin Lerner, and Nadia Polikar- pova. 2025. HiLDE: Intentional Code Generation via Human-in-the-Loop Decod- ing. In2025 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 222–233. doi:10.1109/VL-HCC65237.2025.00032

  22. [22]

    Kevin Han, Siddharth Maddikayala, Tim Knappe, Om Patel, Austen Liao, and Amir Barati Farimani. 2026. TDFlow: Agentic Workflows for Test Driven Development. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Vera Demberg, Kentaro Inui, and Lluís Marquez (Eds.). Association ...

  23. [23]

    Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. InAdvances in psy- chology. Vol. 52. Elsevier, 139–183

  24. [24]

    Grace Murray Hopper. 1952. The education of a computer. InProceedings of the 1952 ACM National Meeting (Pittsburgh)(Pittsburgh, Pennsylvania)(ACM ’52). Association for Computing Machinery, New York, NY, USA, 243–249. doi:10. 1145/609784.609818

  25. [25]

    Grace Murray Hopper. 1969. Standardization of high-level languages. InPro- ceedings of the May 14-16, 1969, Spring Joint Computer Conference(Boston, Mas- sachusetts)(AFIPS ’69 (Spring)). Association for Computing Machinery, New York, NY, USA, 608–609. doi:10.1145/1476793.1476890

  26. [26]

    Yiran Hu, Nan Jiang, Shanchao Liang, Yi Wu, and Lin Tan. 2025. TENET: Lever- aging Tests Beyond Validation for Code Generation. arXiv:2509.24148 [cs.SE] https://arxiv.org/abs/2509.24148

  27. [27]

    Ruanqianqian Huang, Avery Reyna, Sorin Lerner, Haijun Xia, and Brian Hempel

  28. [28]

    arXiv:2512.14012

    Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025. arXiv:2512.14012 [cs.SE] https://arxiv.org/abs/2512.14012

  29. [29]

    GitHub Inc. 2022. GitHub Copilot. https://github.com/features/copilot

  30. [30]

    Stack Exchange Inc. [n. d.]. 2025 Stack Overflow Developer Survey. https: //survey.stackoverflow.co/2025

  31. [31]

    Windsurf Inc. 2024. Windsurf. https://windsurf.com

  32. [32]

    Eshin Jolly. 2018. Pymer4: Connecting R and Python for linear mixed modeling. Journal of Open Source Software3, 31 (2018), 862

  33. [33]

    Hen- ley, Carina Negreanu, and Advait Sarkar

    Majeed Kazemitabaar, Jack Williams, Ian Drosos, Tovi Grossman, Austin Z. Hen- ley, Carina Negreanu, and Advait Sarkar. 2024. Improving Steering and Veri- fication in AI-Assisted Data Analysis with Interactive Task Decomposition. In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST ’24). ACM. doi:10.1145/3654777.3...

  34. [34]

    Maurice George Kendall and Jean Dickinson Gibbons. 1962. Rank correlation methods. (1962)

  35. [35]

    Werner Kunz and Horst W. J. Rittel. 1970.Issues as Elements of Information Systems. Technical Report 131. Institute of Urban and Regional Development, University of California, Berkeley, California

  36. [36]

    LaToza, Maryam Arab, Dastyni Loksa, and Amy J

    Thomas D. LaToza, Maryam Arab, Dastyni Loksa, and Amy J. Ko. 2020. Explicit programming strategies.Empir. Softw. Eng.25, 4 (2020), 2416–2449. doi:10.1007/ S10664-020-09810-1

  37. [37]

    Yunhao Liang, Ruixuan Ying, Shiwen Ni, and Zhe Cui. 2026. Scaling Test- Driven Code Generation from Functions to Classes: An Empirical Study. arXiv:2602.03557 [cs.SE] https://arxiv.org/abs/2602.03557

  38. [38]

    Mackay and Joanna McGrenere

    Wendy E. Mackay and Joanna McGrenere. 2025. Comparative Structured Obser- vation.ACM Trans. Comput.-Hum. Interact.32, 2, Article 14 (April 2025), 27 pages. doi:10.1145/3711838

  39. [39]

    Young, Victoria Bellotti, and Thomas P

    Allan MacLean, Richard M. Young, Victoria Bellotti, and Thomas P. Moran. 1991. Questions, Options, and Criteria: Elements of Design Space Analysis.Hum. Comput. Interact.6, 3-4 (1991), 201–250. doi:10.1080/07370024.1991.9667168

  40. [40]

    2018.Statistical rethinking: A Bayesian course with examples in R and Stan

    Richard McElreath. 2018.Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC

  41. [41]

    Dirk Merkel. 2014. Docker: lightweight linux containers for consistent develop- ment and deployment.Linux journal2014, 239 (2014), 2

  42. [42]

    2026.Visual Studio Code

    Microsoft Corporation. 2026.Visual Studio Code. https://visualstudio.com Version 1.109

  43. [43]

    1996.Design rationale: Concepts, techniques, and use

    Thomas P Moran and John M Carroll. 1996.Design rationale: Concepts, techniques, and use. CRC Press

  44. [44]

    1994.Heuristic Evaluation

    Jakob Neilsen. 1994.Heuristic Evaluation. John Wiley & Sons, Inc., USA. 25–62 pages

  45. [45]

    Anthropic PBC. 2025. Claude Code. https://claude.com/product/claude-code

  46. [46]

    Veronica Pimenova, Sarah Fakhoury, Christian Bird, Margaret-Anne Storey, and Madeline Endres. 2025. Good Vibrations? A Qualitative Study of Co-Creation, Communication, Flow, and Trust in Vibe Coding. arXiv:2509.12491 [cs.SE] https: //arxiv.org/abs/2509.12491

  47. [47]

    Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, and Yan Chen. 2025. Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support. InProceedings of the 2025 CHI conference on human factors in computing systems. 1–21

  48. [48]

    Advait Sarkar and Ian Drosos. 2025. Vibe coding: programming through conver- sation with artificial intelligence. arXiv:2506.23253 [cs.HC] https://arxiv.org/ abs/2506.23253

  49. [49]

    Margaret-Anne Storey. 2026. From Technical Debt to Cognitive and Intent Debt: Rethinking Software Health in the Age of AI. arXiv:2603.22106 [cs.SE] https://arxiv.org/abs/2603.22106

  50. [50]

    Sangho Suh, Meng Chen, Bryan Min, Toby Jia-Jun Li, and Haijun Xia. 2024. Luminate: Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation. InProceedings of the CHI Conference on Human Factors in Computing Systems, CHI 2024, Honolulu, HI, USA, May 11- 16, 2024, Florian ’Floyd’ Mueller, Penny Kyburz, Julie ...

  51. [51]

    Mojtaba Vaismoradi, Hannele Turunen, and Terese Bondas. 2013. Content analy- sis and thematic analysis: Implications for conducting a qualitative descriptive study.Nursing & health sciences15, 3 (2013), 398–405

  52. [52]

    Glassman, Jeevana Priya Inala, and Chenglong Wang

    Priyan Vaithilingam, Elena L. Glassman, Jeevana Priya Inala, and Chenglong Wang. 2024. DynaVis: Dynamically Synthesized UI Widgets for Visualization Editing. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 985, 17 pages. doi:10.1145/...

  53. [53]

    Glassman, and Ian Arawjo

    Priyan Vaithilingam, Munyeong Kim, Frida-Cecilia Acosta-Parenteau, Daniel Lee, Amine Mhedhbi, Elena L. Glassman, and Ian Arawjo. 2025. Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, UIST 2025, Busan, Korea, 28 September 2025 - 1 O...

  54. [54]

    Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models

    Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. InCHI Conference on Human Factors in Computing Systems Ex- tended Abstracts. ACM, New Orleans LA USA, 1–7. doi:10.1145/3491101.3519665

  55. [55]

    Vasudev Vikram, Caroline Lemieux, Joshua Sunshine, and Rohan Padhye. 2023. Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346(2023)

  56. [56]

    1992.Individual Comparisons by Ranking Methods

    Frank Wilcoxon. 1992.Individual Comparisons by Ranking Methods. Springer New York, New York, NY, 196–202. doi:10.1007/978-1-4612-4380-9_16

  57. [57]

    Ryan Yen, Jiawen Stefanie Zhu, Sangho Suh, Haijun Xia, and Jian Zhao. 2024. CoLadder: Manipulating Code Generation via Multi-Level Blocks. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, UIST 2024, Pittsburgh, PA, USA, October 13-16, 2024, Lining Yao, Mayank Goel, Alexandra Ion, and Pedro Lopes (Eds.). ACM, 11:1–1...

  58. [58]

    J. D. Zamfirescu-Pereira, Eunice Jun, Michael Terry, Qian Yang, and Bjoern Hartmann. 2025. Beyond Code Generation: LLM-supported Exploration of the Program Design Space. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI 2025, YokohamaJapan, 26 April 2025- 1 May 2025, Naomi Yamashita, Vanessa Evers, Koji Yatani, Sharon Xia...

  59. [59]

    Wenshuo Zhang, Leixian Shen, Shuchang Xu, Jindu Wang, Jian Zhao, Huamin Qu, and Linping Yuan. 2025. NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification. InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, UIST 2025, Busan, Korea, 28 September 2025 - 1 October 2025, Andrea Bianc...