pith. machine review for the scientific record. sign in

arxiv: 2604.10530 · v1 · submitted 2026-04-12 · 💻 cs.SE · cs.AI· cs.HC

Recognition: unknown

Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:04 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC
keywords AI relianceoverrelianceunderrelianceLLM in software engineeringdeveloper interviewscontrol frameworkappropriate AI usesoftware development tools
0
0 comments X

The pith

The level of control developers exercise over AI outputs can mark overreliance or underreliance on the technology.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Based on interviews with twenty-two software developers who use large language models for coding tasks, the paper builds a preliminary framework that connects the amount of human oversight applied to AI suggestions with patterns of overreliance and underreliance. Overreliance appears when developers accept outputs with little editing or verification, while underreliance shows when they heavily rework or discard AI contributions. This matters because sustained overreliance risks eroding critical thinking skills and underreliance risks forgoing productivity and quality gains from the tools. The framework supplies a concrete way to locate an appropriate middle level of reliance. It also calls for further examination of how current and emerging LLM tools enable different degrees of developer control.

Core claim

From twenty-two interviews with software developers about their LLM use in development work, the authors derive a reliance-control framework. In the framework the degree of control developers retain over AI outputs functions as an indicator that distinguishes overreliance, underreliance, and appropriate reliance. Developers who make few changes to AI-generated code or text tend toward overreliance, those who extensively revise or ignore outputs tend toward underreliance, and the framework positions balanced reliance at intermediate control levels. The authors recommend future research that maps the control options already present in existing tools and that will appear in new ones.

What carries the argument

The reliance-control framework, which classifies developer interactions with AI outputs according to the amount of human editing, verification, or rejection applied.

Load-bearing premise

The level of control developers keep over AI outputs is a valid and generalizable proxy for distinguishing overreliance from underreliance.

What would settle it

A longitudinal study that assigns developers to overreliance, underreliance, or appropriate-reliance groups using the framework's control-level criteria and then measures actual skill retention and productivity changes over six months finds no consistent differences across groups.

Figures

Figures reproduced from arXiv: 2604.10530 by Christoph Treude, John Grundy, Rashina Hoda, Samuel Ferino.

Figure 1
Figure 1. Figure 1: Reliance-Control Framework on AI for SE. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

How software developers interact with Artificial Intelligence (AI)-powered tools, including Large Language Models (LLMs), plays a vital role in how these AI-powered tools impact them. While overreliance on AI may lead to long-term negative consequences (e.g., atrophy of critical thinking skills); underreliance might deprive software developers of potential gains in productivity and quality. Based on twenty-two interviews with software developers on using LLMs for software development, we propose a preliminary reliance-control framework where the level of control can be used as a way to identify AI overreliance and underreliance. We also use it to recommend future research to further explore the different control levels supported by the current and emergent LLM-driven tools. Our paper contributes to the emerging discourse on AI overreliance and provides an understanding of the appropriate degree of reliance as essential to developers making the most of these powerful technologies. Our findings can help practitioners, educators, and policymakers promote responsible and effective use of AI tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a preliminary reliance-control framework for AI in software engineering, derived from 22 interviews with software developers using LLMs. It claims that the level of control developers exercise over AI outputs can identify overreliance (low control) versus underreliance (high control), and uses the framework to recommend future research on control levels supported by current and emerging LLM tools, while contributing to discourse on appropriate AI reliance.

Significance. If the control-level proxy can be validated against objective outcomes, the framework would offer a practical conceptual tool for balancing productivity gains from AI tools against risks such as skill atrophy in software development. It provides a novel lens for human-AI collaboration in SE that could inform tool design, developer training, and policy, building on existing work on overreliance. The preliminary framing appropriately signals limited generalizability, but the contribution remains primarily conceptual until the proxy is tested.

major comments (2)
  1. [Abstract / Framework section] Abstract and framework derivation section: The central claim that control level serves as a valid proxy for distinguishing overreliance (low control) from underreliance (high control) rests on interview-derived categories without reported validation against objective measures such as code correctness, task completion time, suggestion acceptance rates, or post-AI skill retention. This proxy assumption is load-bearing for the framework's utility but lacks cross-validation or falsification tests.
  2. [Methods / Study design] Methods / Study design section: No details are provided on participant selection criteria, interview protocol, data analysis method (e.g., thematic coding scheme, inter-rater reliability), or how the 22 interviews were used to construct the specific control levels in the framework. These omissions prevent assessment of the framework's empirical grounding and replicability.
minor comments (1)
  1. [Abstract] The abstract could explicitly note the absence of quantitative validation data to better set reader expectations for a preliminary framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where we agree revisions are warranted to improve clarity, transparency, and the manuscript's positioning as preliminary work.

read point-by-point responses
  1. Referee: [Abstract / Framework section] Abstract and framework derivation section: The central claim that control level serves as a valid proxy for distinguishing overreliance (low control) from underreliance (high control) rests on interview-derived categories without reported validation against objective measures such as code correctness, task completion time, suggestion acceptance rates, or post-AI skill retention. This proxy assumption is load-bearing for the framework's utility but lacks cross-validation or falsification tests.

    Authors: We agree that the control-level proxy is derived from qualitative interview data and has not been validated against objective outcomes. The manuscript already describes the work as preliminary to signal this exploratory scope. In revision we will update the abstract and framework section to more explicitly state the proxy's basis in developer perceptions, note the absence of quantitative validation, and outline specific directions for future empirical testing. This clarification strengthens the conceptual contribution without overstating current evidence. revision: partial

  2. Referee: [Methods / Study design] Methods / Study design section: No details are provided on participant selection criteria, interview protocol, data analysis method (e.g., thematic coding scheme, inter-rater reliability), or how the 22 interviews were used to construct the specific control levels in the framework. These omissions prevent assessment of the framework's empirical grounding and replicability.

    Authors: We acknowledge the methods section requires greater detail. In the revised manuscript we will expand it to describe participant recruitment and selection criteria, the semi-structured interview protocol, the thematic analysis procedure including the coding scheme, any inter-rater reliability checks performed, and the iterative process by which the control levels were derived from the 22 interviews. These additions will improve transparency and replicability. revision: yes

Circularity Check

0 steps flagged

No circularity: framework constructed from external interview data

full rationale

The paper derives its preliminary reliance-control framework directly from twenty-two interviews with software developers, treating the interview insights as the empirical foundation for identifying overreliance (low control) and underreliance (high control). No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claim rests on external qualitative data rather than reducing to its own inputs by construction, satisfying the criteria for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that qualitative interview data can reliably surface general patterns of AI reliance and that the newly introduced control-level construct is a meaningful proxy. No numerical parameters are fitted; the framework itself is the main invented conceptual entity.

axioms (1)
  • domain assumption Interviews with software developers yield valid and transferable insights into real-world AI reliance behaviors.
    The entire framework is built directly from the 22 interviews without external quantitative validation or triangulation mentioned.
invented entities (1)
  • Reliance-Control Framework no independent evidence
    purpose: To identify and distinguish appropriate levels of reliance on AI tools by examining developer control over outputs.
    Newly proposed conceptual structure without independent empirical validation outside this study.

pith-pipeline@v0.9.0 · 5482 in / 1363 out tokens · 55089 ms · 2026-05-10T16:04:59.806583+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 6 canonical work pages

  1. [1]

    Jessica Y Bo, Sophia Wan, and Ashton Anderson. 2025. To rely or not to rely? evaluating interventions for appropriate reliance on large language models. In Proceedings of the Conference on Human Factors in Computing Systems. 1–23

  2. [2]

    Katherine M Collins et al. 2024. Modulating language model experiences through frictions.arXiv preprint arXiv:2407.12804(2024)

  3. [3]

    Effectiveness and Safety of Lattice Radiotherapy in Treating Large Volume Tumors: A Systematic Review and Meta -Analysis,

    Derek DeBellis et al. 2025.2025 DORA State of AI-Assisted Software Development. https://dora.dev/research/ai/#state-of-ai-assisted-software-development

  4. [4]

    2025.AI Gone Wrong: AI Hallucinations & Errors

    Aaron Drapkin. 2025.AI Gone Wrong: AI Hallucinations & Errors. https://tech. co/news/list-ai-failures-mistakes-errors

  5. [5]

    Samuel Ferino et al . 2026. Towards an Appropriate Level of Reliance on AI: A Preliminary Reliance-Control Framework for AI in Software Engineering – Supplementary Information Package. https://doi.org/10.5281/zenodo.18616305

  6. [6]

    Samuel Ferino, Rashina Hoda, John Grundy, and Christoph Treude. 2025. Novice developers’ perspectives on adopting llms for software development: A systematic literature review.ACM Transactions on Software Engineering and Methodology (2025)

  7. [7]

    Gaole He et al. 2023. Knowing about knowing: An illusion of human competence can hinder appropriate reliance on AI systems. InProceedings of the 2023 CHI conference on human factors in computing systems. 1–18

  8. [8]

    2024.Qualitative Research with Socio-Technical Grounded Theory

    Rashina Hoda. 2024.Qualitative Research with Socio-Technical Grounded Theory. Springer

  9. [9]

    Xinyi Hou et al . 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  10. [10]

    Kori Inkpen et al. 2023. Advancing human-AI complementarity: The impact of user expertise and algorithmic tuning on joint decision making.ACM Transactions on Computer-Human Interaction30, 5 (2023), 1–29

  11. [11]

    Ranim Khojah et al . 2024. Beyond code generation: An observational study of chatgpt usage in software engineering practice.Proceedings of the ACM on Software Engineering1, FSE (2024), 1819–1840

  12. [12]

    Sunnie SY Kim et al . 2025. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the Conference on Human Factors in Computing Systems. 1–19

  13. [13]

    Nataliya Kosmyna et al. 2025. Your brain on ChatGPT: Accumulation of cog- nitive debt when using an AI assistant for essay writing task.arXiv preprint arXiv:2506.088724 (2025)

  14. [14]

    Shuai Ma et al. 2023. Who Should I Trust: AI or Myself? Leveraging Human and AI Correctness Likelihood to Promote Appropriate Trust in AI-Assisted Decision- Making. InProceedings of the Conference on Human Factors in Computing Systems. Article 759, 19 pages. doi:10.1145/3544548.3581058

  15. [15]

    Hannah Mayer, L Yee, M Chui, and R Roberts. 2025. Superagency in the workplace: Empowering people to unlock AI’s full potential.McKinsey Digital28 (2025)

  16. [16]

    Fairuz Meem et al . 2025. Why Do Software Practitioners Use ChatGPT for Software Development Tasks?. InProceedings of the International Conference on the Foundations of Software Engineering. 1508–1514. doi:10.1145/3696630.3731667

  17. [17]

    Qodo AI. 2025. State of AI Code Quality 2025. https://www.qodo.ai/reports/state- of-ai-code-quality/. Accessed: 2025-10-15

  18. [18]

    Cornelia Sindermann et al . 2021. Assessing the attitude towards artificial in- telligence: Introduction of a short measure in German, Chinese, and English language.KI-Künstliche intelligenz35, 1 (2021), 109–118

  19. [19]

    Christoph Treude and Marco Gerosa. 2025. How developers interact with AI: A taxonomy of human-AI collaboration in software engineering. In2nd Inter- national Conference on AI Foundation Models and Software Engineering. IEEE, 236–240

  20. [20]

    Thomas Weber et al . 2024. Significant productivity gains through program- ming with large language models.Proceedings of the ACM on Human-Computer Interaction8, EICS (2024), 1–29

  21. [21]

    Ziyao Zhang et al. 2025. Llm hallucinations in practical code generation: Phenom- ena, mechanism, and mitigation.Proceedings of the ACM on Software Engineering 2, ISSTA (2025), 481–503