arxiv: 2604.07167 · v1 · submitted 2026-04-08 · 💻 cs.HC

Recognition: 2 theorem links

· Lean Theorem

Critical Inker: Scaffolding Critical Thinking in AI-Assisted Writing Through Socratic Questioning

Pattie Maes, Philipp Hugenroth, Valdemar Danry

Pith reviewed 2026-05-10 17:19 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI-assisted writingSocratic questioningcritical thinkinglogical validityargument extractionwriting toolscognitive scaffolding

0 comments

The pith

Critical Inker uses Socratic questions and visual highlights to help writers identify and correct logical errors in AI-assisted drafts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Critical Inker to counter the risk that users will offload critical thinking to LLMs when those models handle writing tasks. It describes two concrete methods: a Socratic chatbot that poses targeted questions to surface logical flaws and visual feedback that marks those flaws directly in the text. The authors report that the underlying argument extraction reaches 91.2 percent overlap with human ground-truth annotations and that validity judgments achieve 87 percent accuracy. A small pilot study supplies initial qualitative observations on how the feedback influences reflection during composition. A sympathetic reader would care because the approach offers a practical way to keep human reasoning active even as automation of text production increases.

Core claim

Critical Inker implements a Socratic chatbot and a visual feedback layer that together analyze a writer's text for its underlying arguments and logical validity, then deliver either questions that prompt the user to recognize and fix errors or direct highlights that flag those errors without requiring conversation. The system is built on technical components for argument extraction and validity checking, which are evaluated against ground-truth annotations, and is further explored through a small-scale pilot that gathers early user reactions.

What carries the argument

Socratic chatbot paired with visual feedback that performs logical analysis on the text to surface errors through questions or highlights.

Load-bearing premise

The reported accuracy numbers and pilot observations will continue to support effective scaffolding when the tool is used by many different writers on varied, real-world documents.

What would settle it

A larger user study that measures whether participants using Critical Inker produce drafts with fewer logical errors or show greater independent detection of flaws compared with participants using standard AI writing assistance.

Figures

Figures reproduced from arXiv: 2604.07167 by Pattie Maes, Philipp Hugenroth, Valdemar Danry.

**Figure 1.** Figure 1: Methods for critical reflection: Visual feedback (Left) shows the argument structure directly with validity checking. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Socratic Chatbot: On successful Socratic scaffolding, the system converts the user’s verbalized intention into an [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Structural accuracy across four LLMs compared to [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Latency comparison: Claude Sonnet 4.5 shows sig [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

As Large Language Models (LLMs) increasingly automate writing tasks, there is a growing risk of cognitive deskilling where users offload critical thinking to the system. To address this, we introduce Critical Inker, a writing tool designed to scaffold critical reflection during writing through logical analysis and socratic feedback. We present two methods: (1) A Socratic chatbot using questions to help them realize and fix logical errors in their writing and (2) Visual Feedback, which highlights logical errors in the text without dialog. We detail the technical implementation of the system and evaluate its argument extraction and logical validity accuracy. Our evaluation shows a 91.2% argument overlap with ground truth argument annotations and 87% validity accuracy. Finally, we conducted a small-scale pilot and discuss early qualitative results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Critical Inker reports workable extraction accuracy but the evidence that its feedback actually scaffolds critical thinking is thin.

read the letter

Critical Inker pairs Socratic chatbot questions with visual highlighting of logical issues to keep users thinking while they write with AI help. They report 91.2% argument overlap on extraction and 87% validity accuracy, plus a small pilot study with qualitative feedback. The main thing to know is that the technical modules perform reasonably on the documents they tested, but the link from those modules to preserved critical thinking is not yet demonstrated with outcome measures. What stands out as new is the specific combination of dialog-based Socratic prompts and non-dialog visual cues inside one writing interface. Earlier argument-mining tools exist, and some tutoring systems use questions, but this dual approach aimed at real-time writing assistance looks like a distinct pairing. The implementation details and the reported accuracies show the backend extraction and checking components were built and measured in a straightforward way. The soft spots are in the evaluation of the central claim. The accuracy numbers confirm that the argument analysis works against ground-truth annotations, yet they do not test whether writers actually reflect more, revise arguments more deeply, or avoid deskilling when they receive the feedback. The pilot is described only as small-scale and qualitative, so we get early user impressions but no controlled before-after scores, revision quality metrics, or blinded ratings of critical thinking. That leaves the scaffolding benefit resting on design rationale rather than measured results. This paper is for HCI researchers who build AI writing tools and care about cognitive side effects. Someone looking for concrete ideas on reflective prompts would find the system description and early pilot notes useful. It deserves peer review because the problem is timely, the technical work is grounded, and the authors report what they measured without overclaiming the numbers. The referee process can push for stronger user-outcome data to match the stated goal.

Referee Report

2 major / 1 minor

Summary. The paper introduces Critical Inker, a writing assistance tool that combines a Socratic chatbot for questioning logical errors and visual feedback highlighting such errors to scaffold critical thinking and mitigate cognitive deskilling from LLM use. It describes the system's technical implementation for argument extraction and validity checking, reports evaluation results showing 91.2% argument overlap with ground-truth annotations and 87% validity accuracy, and presents a small-scale pilot study with qualitative results on user experience.

Significance. If the scaffolding claim holds, the work addresses a timely HCI concern about over-reliance on AI writing tools by offering concrete mechanisms for prompting user reflection. The reported technical accuracies indicate a functional backend for argument analysis that could serve as a foundation for future systems, though the absence of direct outcome measures limits immediate impact.

major comments (2)

[Abstract] Abstract and evaluation description: The central claim that Socratic questioning and visual feedback scaffold critical thinking is not supported by evidence linking the feedback mechanisms to user outcomes. The 91.2% argument overlap and 87% validity accuracy validate only the extraction and checking modules against annotations; no pre/post measures of argument quality, revision depth, or critical-thinking rubric scores are reported to establish the causal scaffolding effect.
[Pilot Study] Pilot study description: The small-scale pilot supplies only qualitative anecdotes without quantitative metrics (e.g., blinded ratings of critical reflection or controlled comparison of writing with vs. without the feedback), leaving the generalization of the scaffolding benefit untested beyond the technical accuracies.

minor comments (1)

[Evaluation] Evaluation details such as test-set size, inter-annotator agreement, error analysis, and statistical tests are not provided, which would improve assessment of the reported 91.2% and 87% figures.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for highlighting the distinction between technical validation and evidence of user-level scaffolding effects. We address each point below and will revise the manuscript to better align claims with the reported results.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation description: The central claim that Socratic questioning and visual feedback scaffold critical thinking is not supported by evidence linking the feedback mechanisms to user outcomes. The 91.2% argument overlap and 87% validity accuracy validate only the extraction and checking modules against annotations; no pre/post measures of argument quality, revision depth, or critical-thinking rubric scores are reported to establish the causal scaffolding effect.

Authors: We agree that the reported 91.2% argument overlap and 87% validity accuracy evaluate only the backend modules for argument extraction and logical validity checking. The manuscript positions Critical Inker as a system that implements Socratic questioning and visual feedback to scaffold reflection, supported by the technical feasibility of these components and qualitative observations from the pilot. No pre/post or controlled outcome measures of critical thinking are present. We will revise the abstract and evaluation summary to state that the work demonstrates a functional implementation with technical accuracies and preliminary qualitative user feedback, while explicitly noting the absence of direct causal evidence on critical-thinking outcomes. revision: yes
Referee: [Pilot Study] Pilot study description: The small-scale pilot supplies only qualitative anecdotes without quantitative metrics (e.g., blinded ratings of critical reflection or controlled comparison of writing with vs. without the feedback), leaving the generalization of the scaffolding benefit untested beyond the technical accuracies.

Authors: The manuscript already describes the pilot as small-scale and limited to qualitative results. We did not perform blinded ratings, pre/post assessments, or controlled comparisons, which would require a separate experimental design. We will expand the pilot study and limitations sections to clarify its exploratory purpose, report the number of participants and session details more precisely, and outline the need for future controlled studies to quantify effects on revision depth and critical reflection. revision: yes

standing simulated objections not resolved

Quantitative pre/post measures or controlled comparisons demonstrating causal effects on critical thinking or argument quality, as these require new user studies outside the scope of the current system-description and pilot work.

Circularity Check

0 steps flagged

No circularity; evaluations use external ground truth and qualitative pilot

full rationale

The paper presents a system implementation for argument extraction, validity checking, Socratic chatbot, and visual feedback. It reports 91.2% argument overlap and 87% validity accuracy measured against separate ground-truth annotations on documents, plus qualitative observations from a small pilot study. No equations, parameter fitting, self-citations, or uniqueness theorems are used in any load-bearing derivation. The technical accuracies are independent benchmarks, and the scaffolding claim rests on descriptive system design plus pilot anecdotes rather than any reduction to fitted inputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on standard assumptions about argument extraction and logical validity being computable from text, plus the domain assumption that users will engage with Socratic prompts. No free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Argument structures and logical validity can be reliably extracted and judged from natural-language writing drafts by current LLMs.
Invoked when the system performs argument extraction and validity checking to generate feedback.

invented entities (1)

Critical Inker system no independent evidence
purpose: Scaffolding tool that combines Socratic chatbot and visual feedback for logical error correction in writing.
The paper introduces and names this specific integrated tool.

pith-pipeline@v0.9.0 · 5440 in / 1244 out tokens · 27870 ms · 2026-05-10T17:19:37.652364+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
We present two methods: (1) A Socratic chatbot using questions to help them realize and fix logical errors in their writing and (2) Visual Feedback, which highlights logical errors in the text without dialog. ... 91.2% argument overlap ... 87% validity accuracy.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery from Law of Logic unclear
The system extracts the argument structure using a LLM pipeline ... Structure Extraction, Logical Evaluation, and Socratic Intervention.

Reference graph

Works this paper leans on

51 extracted references · 11 canonical work pages

[1]

Writing is thinking.Nature Reviews Bioengineering3, 6 (2025), 431–431

2025. Writing is thinking.Nature Reviews Bioengineering3, 6 (2025), 431–431. doi:10.1038/s44222-025-00323-4

work page doi:10.1038/s44222-025-00323-4 2025
[2]

Dhruv Agarwal, Mor Naaman, and Aditya Vashistha. 2025. AI suggestions homogenize writing toward western styles and diminish cultural nuances. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–21

2025
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Man- ning. 2015. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics

2015
[4]

Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychol- ogy.Qualitative Research in Psychology3, 2 (1 2006), 77–101. doi:10.1191/ 1478088706qp063oa

2006
[5]

Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative Research in Sport Exercise and Health11, 4 (6 2019), 589–597. doi:10. 1080/2159676x.2019.1628806 Critical Inker: Scaffolding Critical Thinking in AI-Assisted Writing Through Socratic Questioning , ,

work page arXiv 2019
[6]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z Gajos. 2021. To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making.Proceedings of the ACM on Human-computer Interaction5, CSCW1 (2021), 1–21

2021
[7]

Krzysztof Budzyń, Marcin Romańczyk, Diana Kitala, Paweł Kołodziej, Marek Bugajski, Hans O Adami, Johannes Blom, Marek Buszkiewicz, Natalie Halvorsen, Cesare Hassan, et al. 2025. Endoscopist deskilling risk after exposure to artificial intelligence in colonoscopy: a multicentre, observational study.The Lancet Gastroenterology & Hepatology10, 10 (2025), 896–903

2025
[8]

John M Carroll and Caroline Carrithers. 1984. Training wheels in a user interface. Commun. ACM27, 8 (1984), 800–806

1984
[9]

Michelene TH Chi, Miriam Bassok, Matthew W Lewis, Peter Reimann, and Robert Glaser. 1989. Self-explanations: How students study and use examples in learning to solve problems.Cognitive science13, 2 (1989), 145–182

1989
[10]

Michelene TH Chi, Nicholas De Leeuw, Mei-Hung Chiu, and Christian LaVancher
[11]

Eliciting self-explanations improves understanding.Cognitive science18, 3 (1994), 439–477

1994
[12]

Valdemar Danry, Pat Pataranutaporn, Yaoli Mao, and Pattie Maes. 2023. Don’t just tell me, ask me: Ai systems that intelligently frame explanations as questions improve human logical discernment accuracy over causal ai explanations. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–13

2023
[13]

Yizhou Fan, Luzhen Tang, Huixiao Le, Kejie Shen, Shufang Tan, Yueying Zhao, Yuan Shen, Xinyu Li, and Dragan Gašević. 2025. Beware of metacognitive laziness: Effects of generative artificial intelligence on learning motivation, processes, and performance.British Journal of Educational Technology56, 2 (2025), 489–530

2025
[14]

Radha Firaina and Dwi Sulisworo. 2023. Exploring the usage of ChatGPT in higher education: Frequency and impact on productivity.Buletin Edukasi Indonesia2, 01 (2023), 39–46

2023
[15]

Linda Flower and John R Hayes. 1981. A cognitive process theory of writing. College Composition & Communication32, 4 (1981), 365–387

1981
[16]

Fyfe and Bethany Rittle-Johnson

Emily R. Fyfe and Bethany Rittle-Johnson. 2015. Feedback both helps and hinders learning: The causal role of prior knowledge.Journal of Educational Psychology 108, 1 (6 2015), 82–97. doi:10.1037/edu0000053

work page doi:10.1037/edu0000053 2015
[17]

Maurice Jakesch, Advait Bhat, Daniel Buschek, Lior Zalmanson, and Mor Naaman
[18]

In Proceedings of the 2023 CHI conference on human factors in computing systems

Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI conference on human factors in computing systems. 1–15

2023
[19]

Jiwon Jeong, Hyeju Jang, and Hogun Park. 2025. Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal- Aware Prompt Formulation.arXiv preprint arXiv:2503.23363(2025)

work page arXiv 2025
[20]

Nataliya Kosmyna, Eugene Hauptmann, Ye Tong Yuan, Jessica Situ, Xian-Hao Liao, Ashly Vivian Beresnitzky, Iris Braunstein, and Pattie Maes. 2025. Your brain on ChatGPT: Accumulation of cognitive debt when using an AI assistant for essay writing task.arXiv preprint arXiv:2506.088724 (2025)

work page arXiv 2025
[21]

Nguyen-Thinh Le. 2019. How do technology-enhanced learning tools support critical thinking?. InFrontiers in Education, Vol. 4. Frontiers Media SA, 126

2019
[22]

Mina Lee, Percy Liang, and Qian Yang. 2022. Coauthor: Designing a human- ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19

2022
[23]

Kaveh Eskandari Miandoab, Katharine Kowalyshyn, Kabir Pamnani, Anesu Gavhera, Vasanth Sarathy, and Matthias Scheutz. 2025. IntelliProof: An Ar- gumentation Network-based Conversational Helper for Organized Reflection. arXiv preprint arXiv:2511.04528(2025)

work page arXiv 2025
[24]

Roxana Moreno. 2004. Decreasing Cognitive Load for Novice Students: Effects of Explanatory versus Corrective Feedback in Discovery-Based Multimedia.Instruc- tional Science32, 1-2 (1 2004), 99–113. doi:10.1023/b:truc.0000021811.66966.1d

work page doi:10.1023/b:truc.0000021811.66966.1d 2004
[25]

Seyed Parsa Neshaei, Antonia Tolzin, Yvonne Berkle, Miriam Leuchter, Jan Marco Leimeister, Andreas Janson, and Thiemo Wambsganss. 2025. Leveraging Learner Errors in Digital Argumentation Learning: How ALure Helps Students Learn from their Mistakes and Write Better Arguments.Proc. ACM Hum.-Comput. Interact.9, 2, Article CSCW125 (May 2025), 32 pages. doi:10...

work page doi:10.1145/3711023 2025
[26]

Jakob Nielsen. 2006. Progressive disclosure.Nielsen Norman Group1 (2006)

2006
[27]

Richard Paul and Linda Elder. 2008. Critical thinking: The art of Socratic ques- tioning.Journal of developmental education31, 3 (2008), 34–35

2008
[28]

Roy D Pea and D Midian Kurland. 1987. Cognitive technologies for writing. Review of research in education14 (1987), 277–326

1987
[29]

Colin Porlezza. 2019. Accuracy in journalism.Oxford Research Encyclopedia of Communication(2019)

2019
[30]

Anku Rani, Valdemar Danry, Paul Pu Liang, Andrew B Lippman, and Pattie Maes
[31]

Dialogues with AI Reduce Beliefs in Misinformation but Build No Lasting Discernment Skills.arXiv preprint arXiv:2510.01537(2025)

work page arXiv 2025
[32]

Kirk B Redwine. 2003. The importance of the police report.Criminal Justice Institute School of Law Enforcement Supervision, Session XXII. Retrieved from http://www. cji. edu/site/assets/files/1921/importance_of_police_reports. pdf(last accessed July 2014)(2003)

2003
[33]

Ramon Ruiz-Dolz and John Lawrence. 2023. Detecting argumentative fallacies in the wild: Problems and limitations of large language models. InProceedings of the 10th Workshop on Argument Mining. Association for Computational Linguistics

2023
[34]

Frank Spillers. 2010. Progressive disclosure

2010
[35]

Christian Stab and Iryna Gurevych. 2017. Parsing Argumentation Structures in Persuasive Essays.Comput. Linguist.43, 3 (sep 2017), 619–659. doi:10.1162/ COLI_a_00295

2017
[36]

Christian Stab and Iryna Gurevych. 2017. Recognizing insufficiently supported arguments in argumentative essays. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. 980–990

2017
[37]

John Sweller. 1988. Cognitive load during problem solving: Effects on learning. Cognitive science12, 2 (1988), 257–285

1988
[38]

1987.An Introduction to Legal Reasoning, Writing, and Research Techniques; and Trial Preparation and Appellate Advocacy

Peter N Swisher. 1987.An Introduction to Legal Reasoning, Writing, and Research Techniques; and Trial Preparation and Appellate Advocacy

1987
[39]

Lev Tankelevitch, Viktor Kewenig, Auste Simkute, Ava Elizabeth Scott, Advait Sarkar, Abigail Sellen, and Sean Rintel. 2024. The metacognitive demands and opportunities of generative AI. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–24

2024
[40]

Thiemo Wambsganss, Andreas Janson, Matthias Söllner, Ken Koedinger, and Jan Marco Leimeister. 2024. Improving Students’ Argumentation Skills Using Dynamic Machine-Learning–Based Modeling.Information Systems Research36, 1 (6 2024), 474–507. doi:10.1287/isre.2021.0615

work page doi:10.1287/isre.2021.0615 2024
[41]

Florian Weber, Thiemo Wambsganss, Seyed Parsa Neshaei, and Matthias Soellner
[42]

InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24)

LegalWriter: An Intelligent Writing Support System for Structured and Persuasive Legal Case Writing for Novice Law Students. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA) (CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 1052, 23 pages. doi:10.1145/3613904.3642743

work page doi:10.1145/3613904.3642743 2024
[43]

Linjin Xi, Yi Zhang, and Qiyun Wang. 2025. Investigating the effects of an LLM- based Socratic conversational agent on students’ academic performance and reflective thinking in higher education.Computers & Education(2025), 105494

2025
[44]

Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. 2022. Wordcraft: story writing with large language models. InProceedings of the 27th International Conference on Intelligent User Interfaces. 841–852

2022
[45]

Chao Zhang, Kexin Ju, Peter Bidoshi, Yu-Chun Grace Yen, and Jeffrey M Rzes- zotarski. 2025. Friction: Deciphering Writing Feedback into Writing Revisions through LLM-Assisted Reflection. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–27

2025
[46]

Zheng Zhang, Jie Gao, Ranjodh Singh Dhaliwal, and Toby Jia-Jun Li. 2023. Visar: A human-ai argumentative writing assistant with visual programming and rapid draft prototyping. InProceedings of the 36th annual ACM symposium on user interface software and technology. 1–30. A Prompts To ensure reproducibility, we provide the exact system prompts used for the...

2023
[47]

Extract every distinct argumentative statement ( reasons , premises , evidence ) as separate atomic quotes
[48]

Map which reasons support which claims
[49]

Distinguish direct support ( statements that directly support the main claim ) from indirect support ( statements that support other , , Philipp Hugenroth, Valdemar Danry, and Pattie Maes supporting statements )
[50]

Identify joined reasons ( work ONLY together ) vs independent reasons
[51]

content

Continue tracing support chains until reaching axioms ( unsupported base claims ) JSON Format : { claim : { " content ": " author's core position in your words " , " claim_quote ": " exact quote of the main thesis " , " s up po rt _r ela ti on s ": { " quotes ": { "1": " exact quote atomic reason in this case for claim_quote " , "2": " exact quote atomic ...