arxiv: 2604.25905 · v1 · submitted 2026-04-28 · 💻 cs.CL

Recognition: unknown

A paradox of AI fluency

Christopher Potts , Moritz Sudhof

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords AI fluencyuser engagementvisible failuresinvisible failuresconversational AItask complexityinteraction modes

0 comments

The pith

Fluent AI users achieve more on complex tasks by actively iterating with the system, but this leads to more visible failures compared to novices who often have undetected shortfalls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a user's skill level with AI influences the results they obtain from interactions. It finds that fluent users engage more deeply by refining their goals and evaluating outputs, enabling them to handle harder problems effectively. This approach causes failures to be noticeable and often fixable, contributing to higher success rates overall. Novices, taking a more hands-off approach, tend to end conversations that seem fine but actually fail to meet their needs without realizing it. The findings indicate that success depends heavily on adopting an active stance rather than expecting effortless results.

Core claim

Fluent users take on more complex tasks and adopt a collaborative interaction style involving iteration and critical assessment of outputs, which produces more visible failures that often allow partial recovery, alongside greater success on those complex tasks, whereas novices use a passive stance that more often results in invisible failures where conversations appear successful but miss the mark.

What carries the argument

The contrast in interactional modes between active collaborative iteration by fluent users and passive acceptance by novices, which determines whether failures are visible and recoverable or invisible and undetected.

If this is right

Users should prioritize active engagement, such as iterating on goals and critiquing responses, to maximize benefits from AI on difficult tasks.
AI product builders need to design interfaces that support and encourage deep user involvement instead of minimizing friction at all costs.
Overall, active users will experience more apparent setbacks but achieve better outcomes on ambitious projects.
Passive users risk completing interactions that look good but deliver less value without noticing the gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dynamics could appear in other AI-assisted tools like code generation or data analysis, where engagement style affects outcome quality.
Over time, encouraging active use might help close skill gaps by turning visible failures into learning opportunities.
Design choices that promote passive use could inadvertently reduce the effective capabilities of the AI for many users.
Measuring true success requires tracking not just completion rates but also alignment between stated goals and actual results.

Load-bearing premise

The detailed annotations on the large collection of chat transcripts can reliably categorize users as fluent or novice without bias, determine task complexity accurately, and classify failures correctly as visible or invisible.

What would settle it

A study that tracks actual task outcomes for users with independently verified skill levels on the same set of problems, checking whether visible failure rates and recovery correlate with fluency as described.

Figures

Figures reproduced from arXiv: 2604.25905 by Christopher Potts, Moritz Sudhof.

**Figure 1.** Figure 1: Overall fluency distribution. For descriptions of the three variants of the dataset, view at source ↗

**Figure 2.** Figure 2: Fluency levels over time. The solid lines represent the full dataset. The dashed view at source ↗

**Figure 3.** Figure 3: Fluency and interactional style for the “Standard” dataset (the three subsets view at source ↗

**Figure 4.** Figure 4: Mean rates of fluency and anti-fluency behavior counts across fluency levels, in view at source ↗

**Figure 5.** Figure 5: Fluency and anti-fluency behaviors by fluency level in the “Standard” dataset. view at source ↗

**Figure 6.** Figure 6: Failure rates across fluency levels, for the “Standard” variant of the dataset view at source ↗

**Figure 7.** Figure 7: Fluency, task complexity, and success. Fluency is strongly correlated with increased view at source ↗

**Figure 8.** Figure 8: Invisible failure archetypes across fluency levels. The heatmap is a PPMI matrix. view at source ↗

**Figure 9.** Figure 9: Archetype distribution for our dataset. D Additional regression model details As discussed in section 6.3, we fit a generalized linear mixed-effects model to better understand the extent to which fluency predicts task success and failure visibility. D.1 Success model Our success model is specified as follows using R/lmer notation (Bates et al., 2015): glmer( is_success ˜ 1 + fluency_scalar + n_turns + com… view at source ↗

read the original abstract

How much does a user's skill with AI shape what AI actually delivers for them? This question is critical for users, AI product builders, and society at large, but it remains underexplored. Using a richly annotated sample of 27K transcripts from WildChat-4.8M, we show that fluent users take on more complex tasks than novices and adopt a fundamentally different interactional mode: they iterate collaboratively with the AI, refining goals and critically assessing outputs, whereas novices take a passive stance. These differences lead to a paradox of AI fluency: fluent users experience more failures than novices -- but their failures tend to be visible (a direct consequence of their engagement), they are more likely to lead to partial recovery, and they occur alongside greater success on complex tasks. Novices, by contrast, more often experience invisible failures: conversations that appear to end successfully but in fact miss the mark. Taken together, these results reframe what success with AI depends on. Individuals should adopt a stance of active engagement rather than passive acceptance. AI product builders should recognize that they are designing not just model behavior but user behavior; encouraging deep engagement, rather than friction-free experiences, will lead to more success overall. Our code and data are available at https://github.com/bigspinai/bigspin-fluency-outcomes

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fluent users get more from AI on hard tasks by iterating and catching visible failures, while novices get silent misses, but the annotations driving the claim need close checks.

read the letter

The paper's main point is that fluent AI users achieve more on complex tasks than novices because they engage in iterative collaboration, which makes their failures visible and fixable, while novices experience invisible failures where things seem fine but aren't. This comes from analyzing 27,000 transcripts in WildChat with new annotations on user fluency, task complexity, and failure types. What is new is this specific paradox and the link between interaction style and outcomes. The authors show fluent users tackle harder work and recover from issues more often. They also provide the code and data publicly, which allows others to verify or extend the analysis. That is a real strength for an observational study like this. The work is solid on the descriptive side and offers clear advice for users to be more active and for builders to encourage engagement rather than just smooth experiences. The soft spots are in the annotation details. The abstract leaves out how they operationalized fluent versus novice, how they measured task complexity, and especially how they identified invisible failures without ground truth on user intent. No mention of inter-annotator agreement or controls for other factors. This is central because the whole paradox depends on those classifications being accurate and unbiased. If the labels are noisy, the results could be artifacts. Since the data is released, though, this is not a fatal issue but one that needs addressing. This paper is for people studying real-world AI use, human-computer interaction, or AI product design. A reader looking for empirical insights into user behavior rather than model benchmarks will get value from it. It deserves serious peer review because the sample size is large, the framing is original, and the public data makes it reproducible enough to warrant referee input. I would send it to review, asking referees to check the annotation protocol and any statistical tests on the differences.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes 27K annotated transcripts from the WildChat dataset to argue that fluent AI users undertake more complex tasks and engage in iterative, collaborative interactions with the model, whereas novices adopt a passive stance. This leads to a 'paradox of AI fluency': fluent users encounter more failures, but these are typically visible (prompting engagement and partial recovery) and occur alongside greater success on complex tasks; novices more often experience invisible failures where transcripts end successfully but the outcome misses the mark. The authors conclude that success depends on active user engagement and recommend that AI systems be designed to encourage deep interaction rather than friction-free experiences. Code and data are released.

Significance. If the core classifications hold, the work would usefully reframe AI outcomes as jointly determined by model behavior and user interaction style, with practical implications for interface design and user training. The public release of code and data is a clear strength that supports verification and extension. The distinction between visible and invisible failures offers a novel lens on what counts as 'success' in conversational AI.

major comments (3)

[§3] §3 (Data and Annotation): The central claim rests on annotations that classify users as fluent versus novice, grade task complexity, and label failures as visible versus invisible. No inter-annotator agreement statistics, annotation guidelines, number of annotators, or external validation procedure are reported. Because transcripts alone contain no ground truth for unstated user goals, the visible/invisible distinction and resulting paradox cannot be verified from the provided description.
[§4] §4 (Results): The finding that fluent users experience more failures is presented without statistical controls for task complexity. Since the same section shows fluent users select more complex tasks, the elevated failure rate may be driven by task difficulty rather than fluency per se; this threatens the interpretation that fluency itself produces the reported pattern of visible failures and recoveries.
[§4.3] §4.3 (Failure Recovery Analysis): Claims that visible failures 'are more likely to lead to partial recovery' require explicit operational definitions of recovery and quantitative comparisons (e.g., success rates post-failure for fluent vs. novice cohorts). Without these, the differential-recovery component of the paradox remains unsupported.

minor comments (2)

The abstract states that the sample is 'richly annotated' but does not preview the annotation schema or reliability checks; adding one sentence on these points would improve readability.
[Figures] Figure captions should include sample sizes and any statistical tests used for the visible/invisible failure comparisons.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and will incorporate revisions to strengthen the reporting and analysis. Our responses focus on clarifying the annotation process, adding statistical controls, and providing explicit definitions and comparisons as requested.

read point-by-point responses

Referee: [§3] §3 (Data and Annotation): The central claim rests on annotations that classify users as fluent versus novice, grade task complexity, and label failures as visible versus invisible. No inter-annotator agreement statistics, annotation guidelines, number of annotators, or external validation procedure are reported. Because transcripts alone contain no ground truth for unstated user goals, the visible/invisible distinction and resulting paradox cannot be verified from the provided description.

Authors: We agree these details require expansion. The annotations were conducted by three researchers using guidelines that operationalize fluency via observable interaction patterns (e.g., iterative refinement and critical assessment), task complexity on a 1-5 scale derived from transcript content, and visible failures as those explicitly referenced or corrected in subsequent turns. We will add the full annotation guidelines as supplementary material, report the number of annotators, compute and include inter-annotator agreement statistics, and describe our internal validation procedure. For the ground-truth concern, the visible/invisible distinction is defined strictly from transcript evidence rather than inferred external goals; we will clarify this operationalization to allow verification from the data. revision: yes
Referee: [§4] §4 (Results): The finding that fluent users experience more failures is presented without statistical controls for task complexity. Since the same section shows fluent users select more complex tasks, the elevated failure rate may be driven by task difficulty rather than fluency per se; this threatens the interpretation that fluency itself produces the reported pattern of visible failures and recoveries.

Authors: We acknowledge that task complexity confounds the raw failure rate. Our core claim concerns the interaction mode (active iteration vs. passive acceptance) rather than fluency in isolation, but we will add a controlled analysis in the revision. This will include a logistic regression predicting failure type and recovery outcomes with user fluency as the predictor and task complexity as a covariate, plus stratified comparisons within complexity levels. These additions will test whether the visible-failure and recovery patterns persist beyond difficulty differences. revision: yes
Referee: [§4.3] §4.3 (Failure Recovery Analysis): Claims that visible failures 'are more likely to lead to partial recovery' require explicit operational definitions of recovery and quantitative comparisons (e.g., success rates post-failure for fluent vs. novice cohorts). Without these, the differential-recovery component of the paradox remains unsupported.

Authors: We will revise this section to supply the requested details. Visible failure will be defined as any turn where the user explicitly signals dissatisfaction, requests changes, or notes an error. Recovery will be operationalized as continuation to a subsequent turn where the user indicates satisfaction or the task reaches apparent completion. We will add quantitative results comparing post-failure success/recovery rates between fluent and novice cohorts, including effect sizes and statistical tests. This will directly support the differential-recovery element of the paradox. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical patterns observed in independently annotated transcripts

full rationale

The paper reports observational findings from a sample of 27K WildChat transcripts that the authors annotate for user fluency (fluent vs. novice), task complexity, and failure visibility (visible vs. invisible). The claimed paradox is simply the co-occurrence of these labels in the data: fluent users show more visible failures alongside complex-task success, while novices show more invisible failures. No equations, fitted parameters, predictions, or self-citations are invoked to derive the result; the outcome is not equivalent to the input classifications by construction. The analysis is self-contained once the annotations are accepted as given, with no reduction of the central claim to a definitional or statistical tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the validity of transcript annotations for fluency, task complexity, and failure visibility, plus the representativeness of the 27K sample drawn from WildChat-4.8M; these are domain assumptions without independent verification supplied in the abstract.

axioms (2)

domain assumption Annotations of the 27K transcripts can reliably classify users as fluent versus novice and label task complexity and failure types.
This classification is required to separate the two groups and to identify the paradox.
domain assumption The 27K sample is representative of broader user behaviors in AI chat interactions.
Generalization from the annotated subset to users and society at large rests on this.

pith-pipeline@v0.9.0 · 5521 in / 1525 out tokens · 70399 ms · 2026-05-07T16:11:48.553735+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Anthropic education report: The AI fluency index

Anthropic . Anthropic education report: The AI fluency index. https://www.anthropic.com/research/AI-fluency-index, February 2026. Accessed: 2026-03-11

2026
[2]

Barr, Roger Levy, Christoph Scheepers, and Harry J

Dale J. Barr, Roger Levy, Christoph Scheepers, and Harry J. Tily. Random effects structure in mixed-effects models: Keep it maximal. Journal of Memory and Language, 68 0 (3): 0 255--278, August 2011

2011
[3]

& Walker, S

Douglas Bates, Martin M \"a chler, Ben Bolker, and Steve Walker. Fitting linear mixed-effects models using lme4 . Journal of Statistical Software, 67 0 (1): 0 1--48, 2015. doi:10.18637/jss.v067.i01

work page doi:10.18637/jss.v067.i01 2015
[4]

Canaries in the coal mine? Six facts about the recent employment effects of artificial intelligence

Erik Brynjolfsson, Bharat Chandar, and Ruyu Chen. Canaries in the coal mine? Six facts about the recent employment effects of artificial intelligence. Working paper, Stanford Digital Economy Lab, November 2025 a . URL https://digitaleconomy.stanford.edu/app/uploads/2025/11/CanariesintheCoalMine_Nov25.pdf. Accessed: 2026-03-11

2025
[5]

The Quarterly Journal of Economics , author =

Erik Brynjolfsson, Danielle Li, and Lindsey Raymond. Generative AI at work*. The Quarterly Journal of Economics, 140 0 (2): 0 889--942, 02 2025 b . ISSN 0033-5533. doi:10.1093/qje/qjae044. URL https://doi.org/10.1093/qje/qjae044

work page doi:10.1093/qje/qjae044 2025
[6]

Bullinaria and Joseph P

John A. Bullinaria and Joseph P. Levy. Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39 0 (3): 0 510--526, 2007

2007
[7]

Deming, Zo \"e Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman

Aaron Chatterji, Thomas Cunningham, David J. Deming, Zo \"e Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. How people use ChatGPT . Working Paper 34255, National Bureau of Economic Research, September 2025. URL https://www.nber.org/papers/w34255

2025
[8]

Word association norms, mutual information, and lexicography

Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography. Computational Linguistics, 16 0 (1): 0 22--29, 1990. URL https://aclanthology.org/J90-1003/

1990
[9]

Clark, Robert Schreuder, and Samuel Buttrick

Herbert H. Clark, Robert Schreuder, and Samuel Buttrick. Common ground and the understanding of demonstrative reference. Journal of Verbal Learning and Verbal Behavior, 22 0 (2): 0 245--258, 1983

1983
[10]

The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers

Zheyuan Kevin Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers. Management Science, 2026. doi:10.1287/mnsc.2025.00535. URL https://doi.org/10.1287/mnsc.2025.00535. Articles in Advance, published February 2026

work page doi:10.1287/mnsc.2025.00535 2026
[11]

Framework for AI fluency (practical summary document), 2025

Rick Dakan and Joseph Feller. Framework for AI fluency (practical summary document), 2025. URL https://ringling.libguides.com/ai/framework. Version 1.1, Ringling.edu/ai/. Retrieved on April 20, 2026

2025
[12]

GPTs Are GPTs: Labor Market Impact Potential of LLMs

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. GPTs are GPTs : Labor market impact potential of LLMs . Science, 384 0 (6702): 0 1306--1308, 2024. doi:10.1126/science.adj0998. URL https://www.science.org/doi/abs/10.1126/science.adj0998

work page doi:10.1126/science.adj0998 2024
[13]

Pragmatics in language grounding: Phenomena, tasks, and modeling approaches

Daniel Fried, Nicholas Tomlin, Jennifer Hu, Roma Patel, and Aida Nematzadeh. Pragmatics in language grounding: Phenomena, tasks, and modeling approaches. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 12619--12640, Singapore, December 2023. Association for Computational Ling...

work page doi:10.18653/v1/2023.findings-emnlp.840 2023
[14]

Paul Grice

H. Paul Grice. Logic and conversation. In Peter Cole and Jerry Morgan (eds.), Syntax and Semantics, volume 3: Speech Acts, pp.\ 43--58. Academic Press, New York, 1975

1975
[15]

Language models represent space and time.arXiv preprint arXiv:2310.02207,

Wes Gurnee and Max Tegmark. Language models represent space and time, 2024. URL https://arxiv.org/abs/2310.02207

work page arXiv 2024
[16]

AI safety should prioritize the future of work, 2025

Sanchaita Hazra, Bodhisattwa Prasad Majumder, and Tuhin Chakrabarty. AI safety should prioritize the future of work, 2025. URL https://arxiv.org/abs/2504.13959

work page arXiv 2025
[17]

GLAT : The generative AI literacy assessment test, 2024

Yueqiao Jin, Roberto Martinez-Maldonado, Dragan Gašević, and Lixiang Yan. GLAT : The generative AI literacy assessment test, 2024. URL https://arxiv.org/abs/2411.00283

work page arXiv 2024
[18]

The agency gap: How generative AI literacy shapes independent writing after AI support, 2025

Yueqiao Jin, Kaixun Yang, Roberto Martinez-Maldonado, Dragan Gašević, and Lixiang Yan. The agency gap: How generative AI literacy shapes independent writing after AI support, 2025. URL https://arxiv.org/abs/2507.04398

work page arXiv 2025
[19]

On the measurement of ai literacy among students in higher education: A scoping review

Jeffrey Jones. On the measurement of ai literacy among students in higher education: A scoping review. International Journal of AI in Pedagogy, Innovation, and Learning Futures, 2026 0 (1), Feb. 2026. doi:10.46787/ijaipil.v2026i1.6920. URL https://journals.calstate.edu/ijaipil/article/view/6920

work page doi:10.46787/ijaipil.v2026i1.6920 2026
[20]

The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers

Hao-Ping (Hank) Lee, Advait Sarkar, Lev Tankelevitch, Ian Drosos, Sean Rintel, Richard Banks, and Nicholas Wilson. The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, CHI '25...

work page doi:10.1145/3706598.3713778 2025
[21]

arXiv preprint arXiv:2210.13382 , year=

Kenneth Li, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Emergent world representations: Exploring a sequence model trained on a synthetic task, 2024. URL https://arxiv.org/abs/2210.13382

work page arXiv 2024
[22]

From G-Factor to A-Factor : Establishing a psychometric framework for AI literacy, 2025

Ning Li, Wenming Deng, and Jiatan Chen. From G-Factor to A-Factor : Establishing a psychometric framework for AI literacy, 2025. URL https://arxiv.org/abs/2503.16517

work page arXiv 2025
[23]

Generative artificial intelligence literacy: Scale development and its effect on job performance

Xin Liu, Longxin Zhang, and Xiaochong Wei. Generative artificial intelligence literacy: Scale development and its effect on job performance. Behavioral Sciences, 15 0 (6), 2025. ISSN 2076-328X. doi:10.3390/bs15060811. URL https://www.mdpi.com/2076-328X/15/6/811

work page doi:10.3390/bs15060811 2025
[24]

Can LLM-Simulated practice and feedback upskill human counselors? A randomized study with 90+ novice counselors

Ryan Louie, Raj Sanjay Shah, Ifdita Hasan Orney, Juan Pablo Pacheco, Emma Brunskill, and Diyi Yang. Can LLM-Simulated practice and feedback upskill human counselors? A randomized study with 90+ novice counselors. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2026. Association for Computing Machinery. I...

work page doi:10.1145/3772318.3791821 2026
[25]

URL https://www.bls.gov/opub/mlr/2025/article/ incorporating-ai-impacts-in-bls-employment-projections.htm

Christine Machovec, Michael J. Rieley, and Emily Rolen. Incorporating AI impacts in BLS employment projections: Occupational case studies. Monthly Labor Review, February 2025. doi:10.21916/mlr.2025.1. URL https://www.bls.gov/opub/mlr/2025/article/incorporating-ai-impacts-in-bls-employment-projections.htm. Accessed: 2026-03-11

work page doi:10.21916/mlr.2025.1 2025
[26]

Potemkin understanding in large language models, 2025

Marina Mancoridis, Bec Weeks, Keyon Vafa, and Sendhil Mullainathan. Potemkin understanding in large language models, 2025. URL https://arxiv.org/abs/2506.21521

work page arXiv 2025
[27]

Invisible failures in human-- AI interactions

Christopher Potts and Moritz Sudhof. Invisible failures in human-- AI interactions. arXiv:2603.15423, 2026. URL https://arxiv.org/abs/2603.15423

work page internal anchor Pith review arXiv 2026
[28]

Randrianasolo, Brett Becker, Bailey Kimmel, Jared Wright, and Ben Briggs

James Prather, Brent Reeves, Juho Leinonen, Stephen MacNeil, Arisoa S. Randrianasolo, Brett Becker, Bailey Kimmel, Jared Wright, and Ben Briggs. The widening gap: The benefits and harms of generative AI for novice programmers, 2024. URL https://arxiv.org/abs/2405.17739

work page arXiv 2024
[29]

How AI literacy shapes GenAI use

Maria Rosala. How AI literacy shapes GenAI use. Nielsen Norman Group, February 2026. URL https://www.nngroup.com/articles/ai-literacy/. Retrieved April 22, 2026

2026
[30]

Calibrated trust in dealing with LLM hallucinations: A qualitative study, 2025

Adrian Ryser, Florian Allwein, and Tim Schlippe. Calibrated trust in dealing with LLM hallucinations: A qualitative study, 2025. URL https://arxiv.org/abs/2512.09088

work page arXiv 2025
[31]

InThe F ourteenth International Conference on Learning Representations

Yijia Shao, Vinay Samuel, Yucheng Jiang, John Yang, and Diyi Yang. Collaborative Gym : A framework for enabling and evaluating human-agent collaboration, 2025. URL https://arxiv.org/abs/2412.15701

work page arXiv 2025
[32]

Future of work with AI agents: Auditing automation and augmentation potential across the U.S

Yijia Shao, Humishka Zope, Yucheng Jiang, Jiaxin Pei, David Nguyen, Erik Brynjolfsson, and Diyi Yang. Future of work with AI agents: Auditing automation and augmentation potential across the U.S. \ workforce, 2026. URL https://arxiv.org/abs/2506.06576

work page arXiv 2026
[33]

On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models

Polina Tsvilodub, Jan-Felix Klumpp, Amir Mohammadpour, Jennifer Hu, and Michael Franke. On Emergent Social World Models - Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models . In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026. URL https://arxiv.org/abs/2602.10298

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y. Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (eds.), Advances in Neural Information Processing Systems, volume 37, pp.\ 26941--26975. Curran Associates, Inc., 2024. doi:10...

work page doi:10.52202/079017-0846 2024
[35]

Chang, Ashesh Rambachan, and Sendhil Mullainathan

Keyon Vafa, Peter G. Chang, Ashesh Rambachan, and Sendhil Mullainathan. What has a foundation model found? Using inductive bias to probe for world models, 2025. URL https://arxiv.org/abs/2507.06952

work page arXiv 2025
[36]

2025 AI tools usage statistics: ChatGPT , Claude , Grok , Perplexity , DeepSeek & Gemini , 2025

Views4You . 2025 AI tools usage statistics: ChatGPT , Claude , Grok , Perplexity , DeepSeek & Gemini , 2025. URL https://views4you.com/ai-tools-usage-statistics-report-2025/. Accessed: 2026-03-07

2025
[37]

How do AI agents do human work? comparing AI and human workflows across diverse occupations

Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. How do AI agents do human work? comparing AI and human workflows across diverse occupations. arXiv preprint arXiv:2510.22780, 2025

work page arXiv 2025
[38]

Vulnerability-amplifying interaction loops: a systematic failure mode in AI chatbot mental-health interactions, 2026

Veith Weilnhammer, Kevin YC Hou, Lennart Luettgau, Christopher Summerfield, Raymond Dolan, and Matthew M Nour. Vulnerability-amplifying interaction loops: a systematic failure mode in AI chatbot mental-health interactions, 2026. URL https://arxiv.org/abs/2602.01347

work page arXiv 2026
[39]

Generative AI literacy: Scale development and its influence on privacy protection behaviors and information verification behaviors

Wenjia Yan, Yu li Liu, Valeriia Mamaeva, Fang Dong, Guannan Tao, Rubing Li, and Heng Yang. Generative AI literacy: Scale development and its influence on privacy protection behaviors and information verification behaviors. Telecommunications Policy, 50 0 (2): 0 103117, 2026. ISSN 0308-5961. doi:https://doi.org/10.1016/j.telpol.2025.103117. URL https://www...

work page doi:10.1016/j.telpol.2025.103117 2026
[40]

WildChat : 1M ChatGPT Interaction Logs in the Wild

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. WildChat : 1M ChatGPT interaction logs in the wild, 2024. URL https://arxiv.org/abs/2405.01470

work page arXiv 2024
[41]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[42]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[43]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[44]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...