pith. machine review for the scientific record. sign in

arxiv: 2604.05166 · v1 · submitted 2026-04-06 · 💻 cs.HC · cs.AI

Recognition: 1 theorem link

· Lean Theorem

From Use to Oversight: How Mental Models Influence User Behavior and Output in AI Writing Assistants

AJung Moon, Alexandra Olteanu, Q. Vera Liao, Shalaleh Rismani, Su Lin Blodgett

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:51 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords mental modelsAI writing assistantsuser oversightcontrol behaviorusabilitygrammatical errorshuman-AI interactiontrust in AI
0
0 comments X

The pith

Structural mental models of AI writing assistants improve understanding and usability ratings but lead to more grammatical errors in user outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how users form mental models of AI writing tools—focusing either on what the system does or on how it works internally—and how those models shape oversight during writing tasks. Researchers primed participants with different system descriptions, then had them compose cover letters using an assistant that inserted occasional ungrammatical suggestions. Those given structural descriptions understood the tool better and rated it more usable, yet left more errors uncorrected in their final letters. This pattern shows that deeper technical insight does not automatically produce more careful editing or higher-quality results when AI outputs can be flawed.

Core claim

Participants primed with structural descriptions of the AI writing assistant demonstrated a better understanding of the system, judged the system as more usable, and produced cover letters containing more grammatical errors than participants primed with functional descriptions, even though the AI occasionally supplied ungrammatical suggestions.

What carries the argument

Priming via functional versus structural system descriptions to induce distinct mental models that then guide control behaviors such as requesting, accepting, or editing AI suggestions.

If this is right

  • Structural understanding can raise usability perceptions while lowering actual error detection in AI-assisted writing.
  • Functional descriptions may support more active editing of AI suggestions and fewer uncorrected mistakes.
  • System explanations given to users can shape both trust and oversight behavior in error-prone AI tools.
  • Design choices about how much internal detail to reveal affect output quality beyond simple comprehension gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers may need to add explicit error-checking prompts when supplying structural details about an AI system.
  • The backfiring pattern could appear in other oversight-heavy domains such as AI code completion or report generation.
  • Repeated use of structural explanations might gradually lower user vigilance if people come to feel they fully 'know' the tool.

Load-bearing premise

The system descriptions successfully created different mental models, and the rise in grammatical errors reflects reduced oversight rather than differences in participants' writing skill or attention.

What would settle it

A follow-up study that measures each participant's actual mental model after priming and separately assesses their baseline writing proficiency to check whether error differences remain when skill is held constant.

Figures

Figures reproduced from arXiv: 2604.05166 by AJung Moon, Alexandra Olteanu, Q. Vera Liao, Shalaleh Rismani, Su Lin Blodgett.

Figure 1
Figure 1. Figure 1: This figure illustrates the experimental design and procedure. White boxes indicate stages where participants [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A screenshot of the AI-based writing assistant with the text editor where the main letter is written in the center, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bar graphs of mean values (with error bars) for five control behavior measures—suggestion requests, acceptance ratio, [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bar graphs of mean values, with error bars, for calibrated corrections identified by Grammarly in final letters, error [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bar graph of mean values (and error bars) for rubric-based cover letter quality ratings across dimensions of relevance, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bar graphs of mean values (and error bars) for self-reported ratings across conditions for usefulness, ease of use, [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

AI-based writing assistants are ubiquitous, yet little is known about how users' mental models shape their use. We examine two types of mental models -- functional or related to what the system does, and structural or related to how the system works -- and how they affect control behavior -- how users request, accept, or edit AI suggestions as they write -- and writing outcomes. We primed participants ($N = 48$) with different system descriptions to induce these mental models before asking them to complete a cover letter writing task using a writing assistant that occasionally offered preconfigured ungrammatical suggestions to test whether the mental models affected participants' critical oversight. We find that while participants in the structural mental model condition demonstrate a better understanding of the system, this can have a backfiring effect: while these participants judged the system as more usable, they also produced letters with more grammatical errors, highlighting a complex relationship between system understanding, trust, and control in contexts that require user oversight of error-prone AI outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript reports an empirical user study (N=48) in which participants were primed with functional or structural descriptions of an AI writing assistant before completing a cover-letter task that included occasional preconfigured ungrammatical suggestions. The central claim is that structural mental models produce better system understanding and higher usability ratings yet also yield letters containing more grammatical errors, interpreted as a backfiring effect on critical oversight of AI output.

Significance. If the causal link between mental-model induction, oversight behavior, and error rates can be isolated, the work would usefully extend HCI research on human-AI collaboration by showing that greater system transparency can sometimes reduce rather than increase user scrutiny. It offers a concrete, falsifiable prediction about the relationship between understanding, trust, and control that could inform the design of oversight-supporting interfaces.

major comments (3)
  1. The experimental design provides no pre-task measure of participants' baseline writing or editing skill. Because the key dependent variable is the number of grammatical errors in the final letter, any between-condition difference could be driven by uneven distribution of writing ability across the small N=48 sample rather than by changes in oversight behavior.
  2. No attention checks, manipulation checks for the mental-model induction, or correlational analyses linking error counts to observable control actions (e.g., acceptance rate of the deliberately ungrammatical suggestions) are described. Without such evidence, the interpretation that structural-model participants exercised less critical oversight remains unanchored.
  3. The manuscript reports no statistical details (effect sizes, confidence intervals, power analysis, or exact tests) for the claimed differences in error rates and usability judgments. With a modest sample and multiple dependent measures, these omissions make it impossible to assess whether the backfiring effect is robust or an artifact of the particular analysis choices.
minor comments (1)
  1. The abstract and methods would benefit from explicit operational definitions of 'functional' versus 'structural' mental models and of how grammatical errors were counted and verified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating where we agree and the revisions we will make to improve the manuscript.

read point-by-point responses
  1. Referee: The experimental design provides no pre-task measure of participants' baseline writing or editing skill. Because the key dependent variable is the number of grammatical errors in the final letter, any between-condition difference could be driven by uneven distribution of writing ability across the small N=48 sample rather than by changes in oversight behavior.

    Authors: We agree this is a valid limitation given the sample size. Although participants were randomly assigned to conditions (which should balance baseline differences on average), we lack direct evidence of balance without pre-measures. In the revision we will add an explicit discussion of this limitation in the paper and note that future studies should include pre-task writing assessments. The reported differences in system understanding and usability ratings (which are less directly tied to writing skill) provide some convergent support for the mental-model effects. revision: partial

  2. Referee: No attention checks, manipulation checks for the mental-model induction, or correlational analyses linking error counts to observable control actions (e.g., acceptance rate of the deliberately ungrammatical suggestions) are described. Without such evidence, the interpretation that structural-model participants exercised less critical oversight remains unanchored.

    Authors: The manuscript already reports that structural-condition participants demonstrated better system understanding; we will revise the text to present this more explicitly as evidence of successful mental-model induction. We did not include formal attention checks in the original protocol and will note this as a limitation. To better anchor the oversight interpretation, we will add correlational analyses in the revision examining the relationship between acceptance rates of the preconfigured ungrammatical suggestions and final grammatical error counts. revision: yes

  3. Referee: The manuscript reports no statistical details (effect sizes, confidence intervals, power analysis, or exact tests) for the claimed differences in error rates and usability judgments. With a modest sample and multiple dependent measures, these omissions make it impossible to assess whether the backfiring effect is robust or an artifact of the particular analysis choices.

    Authors: We agree that fuller statistical reporting is necessary. In the revised manuscript we will add effect sizes, 95% confidence intervals, exact p-values, and a post-hoc power analysis for the key comparisons between conditions. We will also clarify the statistical tests used and address any multiple-comparison considerations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical user study with no derivations or self-referential reductions

full rationale

The paper is a controlled user study (N=48) that primes participants with system descriptions to induce functional vs. structural mental models, then measures control behaviors and writing outcomes on a cover-letter task. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. All claims rest on direct experimental observations (e.g., error counts, usability ratings) rather than any definitional equivalence, self-citation chain, or renaming of inputs as outputs. The central finding—that structural priming yields better system understanding yet more grammatical errors—is presented as an empirical result, not a logical necessity derived from the study design itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that the priming manipulation created the intended mental models and that observed error differences reflect oversight behavior.

axioms (1)
  • domain assumption Priming with different system descriptions induces distinct functional versus structural mental models
    The experimental design attributes behavioral differences to mental models induced by the descriptions.

pith-pipeline@v0.9.0 · 5490 in / 1178 out tokens · 37641 ms · 2026-05-10T18:51:24.871937+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

82 extracted references · 59 canonical work pages

  1. [1]

    Tazin Afrin, Omid Kashefi, Christopher Olshefski, Diane Litman, Rebecca Hwa, and Amanda Godley. 2021. Effective Interfaces for Student-Driven Revision Sessions for Argumentative Writing. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21, Article 58). Association for Computing Machinery, New York, NY, U...

  2. [2]

    Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. 2025. AI suggestions homogenize writing toward western styles and diminish cultural nuances. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–21. doi:10.1145/3706598.3713564

  3. [3]

    T. A. Bach, A. Khan, H. Hallock, G. Beltrão, and S. Sousa. 2024. A Systematic Literature Review of User Trust in AI-Enabled Systems: An HCI Perspective. International Journal of Human–Computer Interaction40, 5 (2024), 1251–1266. doi:10.1080/10447318.2022.2138826

  4. [4]

    Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz. 2019. Beyond accuracy: The role of mental models in human-AI team performance. InProceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7. Association for the Advancement of Artificial Intelligence (AAAI), Honolulu, Hawaii, USA, 2–11. doi:10...

  5. [5]

    Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. 2021. Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan)(CHI ’21, Article 81). Association for...

  6. [6]

    Kevin Bauer, Moritz von Zahn, and Oliver Hinz. 2023. Expl(AI)ned: The Impact of Explainable Artificial Intelligence on Users’ Information Processing.Information Systems Research34, 4 (2023), 1582–1602. doi:10.1287/isre.2023.1199

  7. [7]

    Ralf Bender and Stefan Lange. 2001. Adjusting for multiple testing–when and how?J. Clin. Epidemiol.54, 4 (April 2001), 343–349. doi:10 .1016/s0895- 4356(00)00314-0

  8. [8]

    Karim Benharrak, Tim Zindulka, and Daniel Buschek. 2024. Deceptive patterns of intelligent and interactive writing assistants. InProceedings of the Third Workshop on Intelligent and Interactive Writing Assistants. ACM, New York, NY, USA, 62–64. doi:10.1145/3690712.3690728

  9. [9]

    Peter Blokland and Genserik Reniers. 2020. Safety Science, a Systems Thinking Perspective: From Events to Mental Models and Sustainable Safety.Sustain. Sci. Pract. Policy12, 12 (June 2020), 5164. doi:10.3390/su12125164

  10. [10]

    Jessica Y Bo, Sophia Wan, and Ashton Anderson. 2025. To rely or not to rely? Evaluating interventions for appropriate reliance on large language models. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–23. doi:10.1145/3706598.3714097

  11. [11]

    Virginia Braun and Victoria Clarke. 2006. Using Thematic Analysis in Psychology. Qualitative Research in Psychology3, 2 (2006), 77–101

  12. [12]

    Virginia Braun and Victoria Clarke. 2023. Toward good practice in thematic analysis: Avoiding common problems and be(com)ing a knowing researcher.Int. J. Transgend. Health24, 1 (2023), 1–6. doi:10.1080/26895269.2022.2129597

  13. [13]

    Michael J Burtscher and Tanja Manser. 2012. Team mental models and their potential to improve teamwork and safety: A review and implications for fu- ture research in healthcare.Saf. Sci.50, 5 (June 2012), 1344–1354. doi:10 .1016/ j.ssci.2011.12.033

  14. [14]

    Daniel Buschek, Martin Zürn, and Malin Eiband. 2021. The Impact of Multiple Parallel Phrase Suggestions on Email Input and Composition Behaviour of Native and Non-Native English Writers. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21, Article 732). Association for Computing Machinery, New York, NY, ...

  15. [15]

    John M Carroll and Judith Reitman Olson. 1988. Mental Models in Human- Computer Interaction. InHandbook of Human-Computer Interaction, Martin Helander (Ed.). North-Holland, Amsterdam, 45–65. doi:10 .1016/B978-0-444- 70536-5.50007-5

  16. [16]

    Karina Cortiñas-Lorenzo, Wanling Cai, and Gavin Doherty. 2025. Designing, implementing, and evaluating AI explanations: A scoping review of Explainable AI frameworks.ACM Trans. Comput. Hum. Interact.32, 6 (Dec. 2025), 1–79. doi:10.1145/3769678

  17. [17]

    Patrick Cox, Jörg Niewöhmer, Nick Pidgeon, Simon Gerrard, Baruch Fischhoff, and Donna Riley. 2003. The use of mental models in chemical risk protection: developing a generic workplace methodology.Risk Anal.23, 2 (April 2003), 311–324. doi:10.1111/1539-6924.00311

  18. [18]

    Hai Dang, Sven Goller, Florian Lehmann, and Daniel Buschek. 2023. Choice Over Control: How Users Write with Large Language Models using Diegetic and Non- Diegetic Prompting. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23, Article 408). Association for Computing Machinery, New York, NY, USA, 1–17. d...

  19. [19]

    Human Error

    Sidney Dekker. 2017.The Field Guide to Understanding “Human Error”(3rd ed.). CRC Press, Boca Raton, FL. doi:10.1201/9781317031833

  20. [20]

    Roel Dobbe. 2022. System Safety and Artificial Intelligence. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency(Seoul, Republic of Korea)(FAccT ’22). Association for Computing Machinery, New York, NY, USA, 1584. doi:10.1145/3531146.3533215

  21. [21]

    Fiona Draxler, Anna Werner, Florian Lehmann, Matthias Hoppe, Albrecht Schmidt, Daniel Buschek, and Robin Welsch. 2024. The AI Ghostwriter Ef- fect: When users do not perceive ownership of AI-generated text but self- declare as authors.ACM Trans. Comput. Hum. Interact.31, 2 (April 2024), 1–40. doi:10.1145/3637875

  22. [22]

    European Union. 2024. Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (AI Act). Official Journal of the European Union. https://eur-lex .europa.eu/eli/reg/ 2024/1689/oj Article 14 (Human Oversight)

  23. [23]

    2020.The impact of driver’s mental models of advanced vehicle technologies on safety and performance

    John Gaspar, Cher Carney, Emily Shull, and William Horrey. 2020.The impact of driver’s mental models of advanced vehicle technologies on safety and performance. Technical Report. AAA Foundation for Traffic Safety and SAFER-SIM

  24. [24]

    John Maurice Gayed, May Kristine Jonson Carlon, Angelu Mari Oriola, and Jeffrey S Cross. 2022. Exploring an AI-based writing Assistant’s impact on English language learners.Computers and Education: Artificial Intelligence3 (Jan. 2022), 100055. doi:10.1016/j.caeai.2022.100055

  25. [25]

    Biniam Gebru, Lydia Zeleke, Daniel Blankson, Mahmoud Nabil, Shamila Nateghi, Abdollah Homaifar, and Edward Tunstel. 2022. A Review on Human–Machine Trust Evaluation: Human-Centric and Machine-Centric Perspectives.IEEE Transactions on Human-Machine Systems52, 5 (2022), 952–962. doi:10 .1109/ THMS.2022.3144956

  26. [26]

    Michael Gerlich. 2025. AI tools in society: Impacts on cognitive offloading and the future of critical thinking.Societies (Basel)15, 1 (Jan. 2025), 6. doi:10 .3390/ soc15010006

  27. [27]

    Katy Ilonka Gero, Zahra Ashktorab, Casey Dugan, Qian Pan, James Johnson, Werner Geyer, Maria Ruiz, Sarah Miller, David R Millen, Murray Campbell, Sadhana Kumaravel, and Wei Zhang. 2020. Mental Models of AI Agents in a Cooperative Game Setting. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Assoc...

  28. [28]

    Katy Ilonka Gero, Vivian Liu, and Lydia Chilton. 2022. Sparks: Inspiration for Science Writing using Language Models. InProceedings of the 2022 ACM Designing Interactive Systems Conference(Virtual Event, Australia)(DIS ’22). Asso- ciation for Computing Machinery, New York, NY, USA, 1002–1019. doi:10.1145/ 3532106.3533533 CHI ’26, April 13–17, 2026, Barcel...

  29. [29]

    Katy Ilonka Gero, Tao Long, and Lydia B Chilton. 2023. Social Dynamics of AI Support in Creative Writing. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23, Article 245). Association for Computing Machinery, New York, NY, USA, 1–15

  30. [30]

    Steven M Goodman, Erin Buehler, Patrick Clary, Andy Coenen, Aaron Donsbach, Tiffanie N Horne, Michal Lahav, Robert MacDonald, Rain Breaw Michaels, Ajit Narayanan, Mahima Pushkarna, Joel Riley, Alex Santana, Lei Shi, Rachel Sweeney, Phil Weaver, Ann Yuan, and Meredith Ringel Morris. 2022. LaMPost: Design and Evaluation of an AI-assisted Email Writing Proto...

  31. [31]

    Grammarly. 2019. How Correctness Keeps Your Writing Sharp | Grammarly Spotlight. https://www .grammarly.com/blog/product/correctness-dimension/. Accessed: 2025-6-17

  32. [32]

    Alicia Guo, Shreya Sathyanarayanan, Leijie Wang, Jeffrey Heer, and Amy X Zhang. 2025. From pen to prompt: How creative writers integrate AI into their writing practice. InProceedings of the 2025 Conference on Creativity and Cognition. ACM, New York, NY, USA, 527–545. doi:10.1145/3698061.3726910

  33. [33]

    Angel Hsing-Chi Hwang, Q Vera Liao, Su Lin Blodgett, Alexandra Olteanu, and Adam Trischler. 2025. ’It was 80% me, 20% AI’: Seeking Authenticity in Co- Writing with Large Language Models.Proc. ACM Hum. Comput. Interact.9, 2 (May 2025), 1–41. doi:10.1145/3711020

  34. [34]

    Daphne Ippolito, Ann Yuan, Andy Coenen, and Sehmon Burnam. 2022. Creative Writing with an AI-Powered Writing Assistant: Perspectives from Professional Writers. arXiv:2211.05030 [cs.HC] arXiv preprint

  35. [35]

    Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman

  36. [36]

    In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23, Article 111)

    Co-Writing with Opinionated Language Models Affects Users’ Views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany)(CHI ’23, Article 111). Association for Computing Machinery, New York, NY, USA, 1–15. doi:10.1145/3544548.3581196

  37. [37]

    1983.Mental Models

    Philip N Johnson-Laird. 1983.Mental Models. Harvard University Press, London, England

  38. [38]

    Kowe Kadoma, Marianne Aubin Le Quere, Xiyu Jenny Fu, Christin Munsch, Danaë Metaxa, and Mor Naaman. 2024. The role of inclusion, control, and ownership in workplace AI-mediated communication. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–10. doi:10.1145/3613904.3642650

  39. [39]

    Jeongyeon Kim, Sangho Suh, Lydia B Chilton, and Haijun Xia. 2023. Metaphorian: Leveraging Large Language Models to Support Extended Metaphor Creation for Science Writing. InProceedings of the 2023 ACM Designing Interactive Systems Conference(Pittsburgh, PA, USA)(DIS ’23). Association for Computing Machinery, New York, NY, USA, 115–135. doi:10.1145/3563657.3595996

  40. [40]

    Sunnie S Y Kim, Jennifer Wortman Vaughan, Q Vera Liao, Tania Lombrozo, and Olga Russakovsky. 2025. Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–19. doi:10.1145/3706598.3714020

  41. [41]

    Hannah Rose Kirk, Henry Davidson, Ed Saunders, Lennart Luettgau, Bertie Vidgen, Scott A Hale, and Christopher Summerfield. 2025. Neural steering vectors reveal dose and exposure-dependent impacts of human-AI relationships

  42. [42]

    Chi, and Bongwon Suh

    Aniket Kittur, Ed H. Chi, and Bongwon Suh. 2008. Crowdsourcing User Studies with Mechanical Turk. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 453–456. doi:10.1145/1357054.1357127

  43. [43]

    Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng-Keen Wong. 2013. Too Much, Too Little, or Just Right? Ways Explanations Impact End Users’ Mental Models. In2013 IEEE Symposium on Visual Languages and Human-Centric Computing. IEEE, San Jose, CA, USA, 3–10. doi:10 .1109/ vlhcc.2013.6645235

  44. [44]

    In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24)

    Mina Lee, Katy Ilonka Gero, John Joon Young Chung, Simon Buckingham Shum, Vipul Raheja, Hua Shen, Subhashini Venugopalan, Thiemo Wambsganss, David Zhou, Emad A. Alghamdi, Tal August, Avinash Bhat, Madiha Zahrah Choksi, Senjuti Dutta, Jin L.C. Guo, Md Naimul Hoque, Yewon Kim, Simon Knight, Seyed Parsa Neshaei, Antonette Shibani, Disha Shrivastava, Lila Shr...

  45. [45]

    Mina Lee, Percy Liang, and Qian Yang. 2022. CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. InPro- ceedings of the 2022 CHI Conference on Human Factors in Computing Systems(New Orleans, LA, USA)(CHI ’22, Article 388). Association for Computing Machinery, New York, NY, USA, 1–19. doi:10.1145/3491102.3502030

  46. [46]

    Nancy Leveson. 2004. A new accident model for engineering safer systems.Saf. Sci.42, 4 (April 2004), 237–270. doi:10.1016/s0925-7535(03)00047-x

  47. [47]

    Nancy Leveson and John Thomas. 2018. STPA Handbook. https:// psas.scripts.mit.edu/home/get_file.php?name=STPA_handbook.pdf

  48. [48]

    2012.Engineering a safer world: Systems thinking applied to safety

    Nancy G Leveson. 2012.Engineering a safer world: Systems thinking applied to safety. MIT Press, London, England

  49. [49]

    Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. How Does the Disclosure of AI Assistance Affect the Perceptions of Writing?. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Lin- guistics, Miami, Florida, USA, 4849–4868...

  50. [50]

    Zhuoyan Li, Chen Liang, Jing Peng, and Ming Yin. 2024. The Value, Benefits, and Concerns of Generative AI-Powered Assistance in Writing. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1048:1–1048:25. doi:10.1145/3613904.3642625

  51. [51]

    Shuai Ma, Qiaoyi Chen, Xinru Wang, Chengbo Zheng, Zhenhui Peng, Ming Yin, and Xiaojuan Ma. 2025. Towards human-AI deliberation: Design and evaluation of LLM-empowered deliberative AI for AI-assisted decision-making. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–23. doi:10.1145/3706598.3713423

  52. [52]

    Khalid Mehmood, Katrien Verleye, Arne De Keyser, and Bart Larivière. 2025. From promises to practice: Unravelling users’ mental models about Large Language Models at individual and societal levels. doi:10.2139/ssrn.5598465 Preprint paper

  53. [53]

    Piotr Mirowski, Kory W Mathewson, Jaylen Pittman, and Richard Evans. 2023. Co- Writing Screenplays and Theatre Scripts with Language Models: Evaluation by Industry Professionals. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23, Article 355). Association for Computing Machinery, New York, NY, USA, 1–...

  54. [54]

    Bonnie M. Muir. 1987. Trust Between Humans and Machines and the Design of Decision Aids.International Journal of Man-Machine Studies27, 5–6 (1987), 527–539. doi:10.1016/S0020-7373(87)80013-5

  55. [55]

    Sheryl Wei Ting Ng and Renwen Zhang. 2025. Trust in AI chatbots: A sys- tematic review.Telemat. Inform.97, 102240 (Feb. 2025), 102240. doi:10 .1016/ j.tele.2025.102240

  56. [56]

    D A Norman. 1987. Some observations on mental models. InHuman-computer interaction: a multidisciplinary approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 241–244

  57. [57]

    Mahsan Nourani, Chiradeep Roy, Jeremy E Block, Donald R Honeycutt, Tahrima Rahman, Eric Ragan, and Vibhav Gogate. 2021. Anchoring Bias Affects Mental Model Formation and User Reliance in Explainable AI Systems. In26th Interna- tional Conference on Intelligent User Interfaces (IUI ’21). Association for Computing Machinery, New York, NY, USA, 340–350. doi:1...

  58. [58]

    Aswati Panicker, Novia Nurain, Zaidat Ibrahim, Chun-Han (ariel) Wang, Se- ung Wan Ha, Yuxing Wu, Kay Connelly, Katie A Siek, and Chia-Fang Chung. 2024. Understanding fraudulence in online qualitative studies: From the researcher’s perspective. InProceedings of the CHI Conference on Human Factors in Computing Systems. ACM, New York, NY, USA, 1–17. doi:10.1...

  59. [59]

    Gabriele Paolacci and Jesse Chandler. 2014. Inside the Turk: Understanding mechanical Turk as a participant pool.Curr. Dir. Psychol. Sci.23, 3 (June 2014), 184–188. doi:10.1177/0963721414531598

  60. [60]

    Samir Passi, Shipi Dhanorkar, and Mihaela Vorvoreanu. 2025. Addressing over- reliance on AI. InHandbook of Human-Centered Artificial Intelligence. Springer Nature Singapore, Singapore, 1–34. doi:10.1007/978-981-97-8440-0_98-1

  61. [61]

    Ritika Poddar, Rashmi Sinha, Mor Naaman, and Maurice Jakesch. 2023. AI Writing Assistants Influence Topic Choice in Self-Presentation. InExtended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI EA ’23, Article 29). Association for Computing Machinery, New York, NY, USA, 1–6. doi:10.1145/3544549.3585893

  62. [62]

    Jens Rasmussen. 1987. Mental models and the control of action in complex envi- ronments. InSelected papers of the 6th Interdisciplinary Workshop on Informatics and Psychology: Mental Models and Human-Computer Interaction 1. North-Holland Publishing Co., NLD, 41–69

  63. [63]

    Mohi Reza, Jeb Thomas-Mitchell, Peter Dushniku, Nathan Laundry, Joseph Jay Williams, and Anastasia Kuzminykh. 2025. Co-writing with AI, on human terms: Aligning research with user demands across the writing process.Proc. ACM Hum. Comput. Interact.9, 7 (Oct. 2025), 1–37. doi:10.1145/3757566

  64. [64]

    Shalaleh Rismani, Renee Shelby, Leah Davis, Negar Rostamzadeh, and Ajung Moon. 2025. Measuring what matters: Connecting AI ethics evaluations to system attributes, hazards, and harms.Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society8, 3 (Oct. 2025), 2199–2213

  65. [65]

    InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI 2021)

    Ronald E Robertson, Alexandra Olteanu, Fernando Diaz, Milad Shokouhi, and Peter Bailey. 2021. “I Can’t Reply with That”: Characterizing Problematic Email Reply Suggestions. InProceedings of the 2021 CHI Conference on Human Factors in Computing Systems(Yokohama, Japan)(CHI ’21). Association for Computing Ma- chinery, New York, NY, USA, Article 724, 18 page...

  66. [66]

    Automation bias in human-AI collaboration: A review.AI & Society, 2025

    G. Romeo and E. Conti. 2025. Exploring automation bias in human–AI collabo- ration: a review and implications for explainable AI.AI & Society40, 3 (2025), 789–812. doi:10.1007/s00146-025-02422-7 From Use to Oversight: How Mental Models Influence User Behavior and Output in AI Writing Assistants CHI ’26, April 13–17, 2026, Barcelona, Spain

  67. [67]

    Sarter, Christopher D

    Nadine B. Sarter, Christopher D. Wickens, Randall J. Mumaw, Steve Kimball, Roger Marsh, Mark I. Nikolic, and W. Q. Xu. 2003. Modern flight deck au- tomation: pilots’ mental model and monitoring patterns and performance. https://api.semanticscholar.org/CorpusID:107831159. Presented at the 12th In- ternational Symposium on Aviation Psychology

  68. [68]

    Ulrike Schäfer, Lars Sipos, and Claudia Müller-Birn. 2025. ‘The AI is uncertain, so am I. What now?’: Navigating Shortcomings of Uncertainty Representations in Human-AI Collaboration with Capability-focused Guidance.Proc. ACM Hum. Comput. Interact.9, 7 (Oct. 2025), 1–48. doi:10.1145/3757451

  69. [69]

    Sathya S Silva and R John Hansman. 2015. Divergence Between Flight Crew Mental Model and Aircraft System State in Auto-Throttle Mode Confusion Acci- dent and Incident Cases.Journal of Cognitive Engineering and Decision Making9, 4 (Dec. 2015), 312–328. doi:10.1177/1555343415597344

  70. [70]

    Nancy Staggers and A F Norcio. 1993. Mental models: concepts for human- computer interaction research.Int. J. Man. Mach. Stud.38, 4 (April 1993), 587–605. https://www.sciencedirect.com/science/article/pii/S002073738371028X

  71. [71]

    Yujie Sun, Dongfang Sheng, Zihan Zhou, and Yifei Wu. 2024. AI hallucination: towards a comprehensive classification of distorted information in artificial intelligence-generated content.Humanit. Soc. Sci. Commun.11, 1 (Sept. 2024), 1–14. doi:10.1057/s41599-024-03811-x

  72. [72]

    Siddharth Swaroop, Zana Buçinca, Krzysztof Z Gajos, and Finale Doshi-Velez

  73. [73]

    InProceedings of the 30th International Conference on Intelligent User Interfaces

    Personalising AI assistance based on overreliance rate in AI-assisted deci- sion making. InProceedings of the 30th International Conference on Intelligent User Interfaces. ACM, New York, NY, USA, 1107–1122. doi:10.1145/3708359.3712128

  74. [74]

    Christopher L Tarola, Sameer Hirji, Steven J Yule, Jennifer M Gabany, Alessandro Zenati, Roger D Dias, and Marco A Zenati. 2018. Cognitive Support to Promote Shared Mental Models during Safety-Critical Situations in Cardiac Surgery (Late Breaking Report). In2018 IEEE Conference on Cognitive and Computational As- pects of Situation Management (CogSIMA). In...

  75. [75]

    Mor Vered, Tali Livni, Piers Douglas Lionel Howe, Tim Miller, and Liz Sonenberg

  76. [76]

    Intell.322, 103952 (Sept

    The effects of explanations on automation bias.Artif. Intell.322, 103952 (Sept. 2023), 103952. doi:10.1016/j.artint.2023.103952

  77. [77]

    Kailas Vodrahalli, Roxana Daneshjou, Tobias Gerstenberg, and James Zou. 2022. Do Humans Trust Advice More if it Comes from AI? An Analysis of Human- AI Interactions. InProceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society(Oxford, United Kingdom)(AIES ’22). Association for Computing Machinery, New York, NY, USA, 763–777. doi:10.1145/351409...

  78. [78]

    Hanna Wallach, Meera Desai, A Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P Alex Dow, Jean Garcia-Gathright, Alexandra Olteanu, Nicholas J Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, and Abigail Z Jacobs. 2025. Position: Evalu...

  79. [79]

    Xingyi Wang, Xiaozheng Wang, Sunyup Park, and Yaxing Yao. 2025. Users’ Mental Models of Generative AI Chatbot Ecosystems. InProceedings of the 30th International Conference on Intelligent User Interfaces (IUI ’25). ACM, Cagliari, Italy. doi:10.1145/3708359.3712125

  80. [80]

    Gesa Wiegand, Matthias Schmidmaier, Thomas Weber, Yuanting Liu, and Hein- rich Hussmann. 2019. I Drive - You Trust: Explaining Driving Behavior Of Autonomous Cars. InExtended Abstracts of the 2019 CHI Conference on Hu- man Factors in Computing Systems(Glasgow, Scotland Uk)(CHI EA ’19, Paper LBW0163). Association for Computing Machinery, New York, NY, USA,...

Showing first 80 references.