Recognition: unknown
Counterargument for Critical Thinking as Judged by AI and Humans
Pith reviewed 2026-05-08 16:23 UTC · model grok-4.3
The pith
Students' counterarguments to AI-generated theses contain logic as a key sign of critical thinking, and LLMs can assess such writing with moderate alignment to human judges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that students' self-written counterarguments to AI-generated content contain logic, among other things, which is a key component of critical thinking, and that GenAI can be successfully used at scale to assess students' written work based on clear rubrics, with assessments generally aligning with human assessments as shown by Gwet's AC2 inter-rater reliability values of 0.33 for all the models except one.
What carries the argument
Six established rubrics (focus, logic, content, style, correctness, reference) applied uniformly to student counterarguments against AI-generated thesis statements, enabling direct quantitative comparison between human and LLM judges on the same 5-point scale.
Load-bearing premise
The six rubrics, especially the logic rubric, validly measure critical thinking and the single-course sample of 35 submissions generalizes beyond this university setting and topic set.
What would settle it
A replication study with a larger multi-institution sample where either the logic rubric scores fail to correlate with an independent validated critical thinking test or where the LLMs show substantially lower agreement with humans than the reported AC2 of 0.33.
Figures
read the original abstract
This intervention study investigates the use of counterarguments in writing for critical thinking by students in the context of Generative AI (GenAI). This is especially as risks of cheating and cognitive offloading exist with the use of GenAI. We presented 36 students in a particular university course with 4 carefully selected thesis statements (from a set of popular debates) to write about anyone of them. We used six established rubrics (focus, logic, content, style, correctness and reference) to conduct three human assessments (two student peer-reviews and one experienced teacher) per writeup on a 5-point Likert scale for all the qualified samples (n) of 35 submissions (after disqualifying one for irregularity). Using the same rubrics and guidelines, we also assessed the submissions using six frontier LLMs as judges. Our mixed-method design included qualitative open-ended feedback per assessment and quantitative methods. The results reveal that (1) the students' self-written counterarguments to AI-generated content contains logic, among other things, which is a key component of critical thinking, and (2) GenAI can be successfully used at scale to assess students' written work, based on clear rubrics, and these assessments generally align with human assessments as shown with Gwets AC2 inter-rater reliability values of 0.33 for all the models except one.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an intervention study with 35 students writing counterarguments to AI-generated theses on four debate topics. Six rubrics (focus, logic, content, style, correctness, reference) were applied on a 5-point scale by three human raters (two peers, one teacher) and six frontier LLMs per submission. Mixed-methods analysis of quantitative scores and qualitative feedback leads to two claims: (1) student counterarguments demonstrate logic (among other elements) as a key component of critical thinking, and (2) LLMs can assess student writing at scale using clear rubrics, with assessments generally aligning with humans (Gwet's AC2 = 0.33 for five of six models).
Significance. If the central claims hold after addressing sample and validation limitations, the work provides empirical support for using counterargument tasks to promote critical thinking in GenAI contexts and for deploying LLMs as scalable graders when rubrics are explicit. The mixed-methods design with multiple human raters and six LLMs offers direct comparative evidence, and the use of established rubrics plus open-ended feedback is a strength. However, the moderate AC2 value and single-course sample constrain claims about broad applicability in education.
major comments (3)
- [Methods (rubrics and assessment design)] Methods section on rubrics and participant selection: The first central claim rests on the 'logic' rubric scores indicating critical thinking, yet no validation data, citations to established CT instruments (e.g., Watson-Glaser or Delphi Report), or correlation with external CT measures are provided. This is load-bearing because the rubric is treated as directly measuring a 'key component' without supporting evidence.
- [Results (quantitative analysis) and Discussion] Results and Discussion on inter-rater reliability: Gwet's AC2 = 0.33 is presented as evidence of general alignment supporting scalable AI assessment, but the value is moderate, one model is an exception, and no error bars, confidence intervals, or baseline comparisons (e.g., to random or majority-class agreement) are reported. This directly affects the strength of claim (2) on successful use at scale.
- [Methods (participants) and Limitations] Methods and Limitations: The sample is restricted to 35 submissions from one university course on four specific topics. No power analysis, multi-site replication plan, or discussion of how rubric scores or AI-human agreement might vary by discipline, institution, or prompt set is included, limiting support for generalization in the abstract and conclusions.
minor comments (2)
- [Abstract and Methods] The abstract states 'n of 35 submissions (after disqualifying one)' but the introduction mentions 36 students; clarify the exact number of valid submissions and disqualification criteria.
- [Results] Table or figure presenting per-model AC2 values and per-rubric breakdowns would improve clarity; currently the single aggregate 0.33 value is hard to interpret without seeing variation across rubrics or models.
Simulated Author's Rebuttal
We thank the referee for these constructive comments on our manuscript. We address each major point below with clarifications and indicate where revisions will be made to strengthen the paper.
read point-by-point responses
-
Referee: [Methods (rubrics and assessment design)] Methods section on rubrics and participant selection: The first central claim rests on the 'logic' rubric scores indicating critical thinking, yet no validation data, citations to established CT instruments (e.g., Watson-Glaser or Delphi Report), or correlation with external CT measures are provided. This is load-bearing because the rubric is treated as directly measuring a 'key component' without supporting evidence.
Authors: We agree that explicit links to critical thinking frameworks would strengthen the first claim. Logic is a core component in established definitions such as the Delphi Report (Facione, 1990), which we will cite in the revised Introduction and Methods when describing the rubrics. The six rubrics were selected from established writing assessment practices for their applicability to argumentative counterarguments. However, this study did not include external validation against instruments like the Watson-Glaser or correlations with other CT measures, as the design focused on an intervention with rubric-based scoring. We will expand the Methods to justify the rubric choice and more explicitly note this as a limitation while qualifying the claim language. revision: partial
-
Referee: [Results (quantitative analysis) and Discussion] Results and Discussion on inter-rater reliability: Gwet's AC2 = 0.33 is presented as evidence of general alignment supporting scalable AI assessment, but the value is moderate, one model is an exception, and no error bars, confidence intervals, or baseline comparisons (e.g., to random or majority-class agreement) are reported. This directly affects the strength of claim (2) on successful use at scale.
Authors: We will revise the Results and Discussion sections to report confidence intervals for the Gwet's AC2 values and include baseline comparisons to chance-level agreement. We acknowledge that 0.33 indicates moderate agreement, which is typical for subjective writing evaluations but not strong evidence of equivalence; we will temper the language of claim (2) to describe 'moderate alignment' and discuss the exception for one model. These additions will be incorporated in the next version. revision: yes
-
Referee: [Methods (participants) and Limitations] Methods and Limitations: The sample is restricted to 35 submissions from one university course on four specific topics. No power analysis, multi-site replication plan, or discussion of how rubric scores or AI-human agreement might vary by discipline, institution, or prompt set is included, limiting support for generalization in the abstract and conclusions.
Authors: We will expand the Limitations section to address the single-course sample of 35 submissions, note that no a priori power analysis was conducted (as the study was exploratory), and qualify statements in the abstract and conclusions to indicate preliminary findings. We will also add discussion of potential variations by topic based on our existing data and recommend future multi-site studies for broader generalization. The sample cannot be expanded retrospectively. revision: partial
Circularity Check
No significant circularity; empirical study relies on external rubrics and ratings
full rationale
The paper reports an empirical intervention study that applies six established rubrics to 35 student submissions, obtains human ratings (peer and teacher), runs the same rubrics on six LLMs, and computes Gwet's AC2 inter-rater agreement. No equations, fitted parameters, or derived quantities appear in the provided text. The rubrics are described as established rather than defined inside the paper; the central claims (presence of logic in counterarguments and general alignment of LLM assessments) are direct empirical observations, not reductions of one quantity to another by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The analysis therefore contains no self-definitional, fitted-input, or self-citation circularity steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Established rubrics (focus, logic, content, style, correctness, reference) validly capture components of critical thinking
- standard math Gwet's AC2 is an appropriate measure for inter-rater agreement on ordinal rubric scores
Reference graph
Works this paper leans on
-
[1]
2015, , 579, A101
Aladro, R., Martín, S., Riquelme, D., et al. 2015, , 579, A101
2015
-
[2]
author Adewumi, T. , author Alkhaled, L. , author Buck, C. , author Hernandez, S. , author Brilioth, S. , author Kekung, M. , author Ragimov, Y. , author Barney, E. , year 2025 a. title Procot: Stimulating critical thinking and writing of students through engagement with large language models (llms) . journal Journal of Pedagogical Sociology and Psycholog...
-
[3]
author Adewumi, T. , author Alkhaled, L. , author Gurung, N. , author van Boven, G. , author Pagliai, I. , year 2024 . title Fairness and bias in multimodal ai: A survey . journal arXiv preprint arXiv:2406.19097
-
[4]
Ai must not be fully autonomous,
author Adewumi, T. , author Alkhaled, L. , author Imbert, F. , author Han, H. , author Habib, N. , author L \"o wenmark, K. , year 2025 b. title Ai must not be fully autonomous . journal arXiv preprint arXiv:2507.23330
-
[5]
, author Habib, N
author Adewumi, T. , author Habib, N. , author Alkhaled, L. , author Barney, E. , year 2025 c. title On the limitations of large language models ( LLM s): False attribution , in: editor Angelova, G. , editor Kunilovskaya, M. , editor Escribe, M. , editor Mitkov, R. (Eds.), booktitle Proceedings of the 15th International Conference on Recent Advances in Na...
2025
-
[6]
author Adewumi, T. , author Liwicki, F.S. , author Liwicki, M. , author Gardelli, V. , author Alkhaled, L. , author Mokayed, H. , year 2025 d. title Findings of mega: Math explanation with llms using the socratic method for active learning . journal IEEE Signal Processing Magazine volume 42 , pages 77--94 . :10.1109/MSP.2025.3590807
-
[7]
, year 2022
author Alkharusi, H. , year 2022 . title A descriptive analysis and interpretation of data from likert scales in educational and psychological research . journal Indian Journal of Psychology and Education volume 12 , pages 13--16
2022
-
[8]
, author Krathwohl, D.R
author Anderson, L.W. , author Krathwohl, D.R. , year 2001 . title A taxonomy for learning, teaching, and assessing: A revision of Bloom's taxonomy of educational objectives: complete edition . publisher Addison Wesley Longman, Inc
2001
-
[9]
, author Engelhart, M.D
author Bloom, B.S. , author Engelhart, M.D. , author Furst, E.J. , author Hill, W.H. , author Krathwohl, D.R. , et al., year 1956 . title Taxonomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain . publisher Longman New York
1956
-
[10]
, year 2013
author Brookhart, S.M. , year 2013 . title How to create and use rubrics for formative assessment and grading . publisher Ascd
2013
-
[11]
, author Wylie, R
author Chi, M.T. , author Wylie, R. , year 2014 . title The icap framework: Linking cognitive engagement to active learning outcomes . journal Educational psychologist volume 49 , pages 219--243
2014
-
[12]
, author Limbach, B
author Duron, R. , author Limbach, B. , author Waugh, W. , year 2006 . title Critical thinking framework for any discipline . journal International Journal of teaching and learning in higher education volume 17 , pages 160--166
2006
-
[13]
, author Hogan, M.J
author Dwyer, C.P. , author Hogan, M.J. , author Stewart, I. , year 2014 . title An integrated critical thinking framework for the 21st century . journal Thinking skills and Creativity volume 12 , pages 43--52
2014
-
[14]
, author Hogan, M.J
author Dwyer, C.P. , author Hogan, M.J. , author Stewart, I. , year 2015 . title The promotion of critical thinking skills through argument mapping
2015
-
[15]
, author Weir, E.E
author Ennis, R.H. , author Weir, E.E. , year 1985 . title The Ennis-Weir critical thinking essay test: An instrument for teaching and testing . publisher Midwest Publications
1985
-
[16]
, year 1990
author Facione, P.A. , year 1990 . title The delphi report: Committee on pre-college philosophy , in: booktitle American Philosophical Association
1990
-
[17]
, author Mengchen, Z
author Fulan, L. , author Mengchen, Z. , author Wenyun, L. , year 2025 . title Corpus-assisted counterargumentation instruction: cultivating critical thinking via argumentative writing . journal Thinking Skills and Creativity , pages 102120
2025
-
[18]
, year 2025 a
author Gerlich, M. , year 2025 a. title Ai tools in society: Impacts on cognitive offloading and the future of critical thinking . journal Societies volume 15 , pages 6
2025
-
[19]
author Gerlich, M. , year 2025 b. title AI tools in society: Impacts on cognitive offloading and the future of critical thinking . journal Societies volume 15 , pages 6 . :10.3390/soc15010006
-
[20]
, author Rapanta, C
author Gonz \'a lez, I. , author Rapanta, C. , author Larrain, A. , year 2026 . title Promoting argumentation skills among university students: A scoping review . journal Higher Education Quarterly volume 80 , pages e70080
2026
-
[21]
, year 2006
author Halpern, R. , year 2006 . title Halpern critical thinking assessment using everyday situations: Background and scoring standards claremont ca: Claremont mckenna college
2006
-
[22]
, author Elgendy, I.A
author Helal, M.Y. , author Elgendy, I.A. , author Albashrawi, M.A. , author Dwivedi, Y.K. , author Al-Ahmadi, M.S. , author Jeon, I. , year 2025 . title The impact of generative ai on critical thinking skills: a systematic review, conceptual framework and future research directions . journal Information Discovery and Delivery
2025
-
[23]
, author McLaughlin, G.W
author Howard, R.D. , author McLaughlin, G.W. , author Knight, W.E. , year 2012 . title The handbook of institutional research . publisher John Wiley & Sons
2012
-
[24]
, author Svingby, G
author Jonsson, A. , author Svingby, G. , year 2007 . title The use of scoring rubrics: Reliability, validity and educational consequences . journal Educational research review volume 2 , pages 130--144
2007
-
[25]
, author Kale, S
author Joshi, A. , author Kale, S. , author Chandel, S. , author Pal, D.K. , year 2015 . title Likert scale: Explored and explained . journal British journal of applied science & technology volume 7 , pages 396
2015
-
[26]
u chemann, S. , author Bannert, M. , author Dementieva, D. , author Fischer, F. , author Gasser, U. , author Groh, G. , author G \
author Kasneci, E. , author Se ler, K. , author K \"u chemann, S. , author Bannert, M. , author Dementieva, D. , author Fischer, F. , author Gasser, U. , author Groh, G. , author G \"u nnemann, S. , author H \"u llermeier, E. , et al., year 2023 . title Chatgpt for good? on opportunities and challenges of large language models for education . journal Lear...
2023
-
[27]
, author Federmann, C
author Kocmi, T. , author Federmann, C. , year 2023 . title Large language models are state-of-the-art evaluators of translation quality , in: booktitle Proceedings of the 24th Annual Conference of the European Association for Machine Translation , pp. pages 193--203
2023
-
[28]
author Kosmyna, N. , author Hauptmann, E. , author Yuan, Y.T. , author Situ, J. , author Liao, X.H. , author Beresnitzky, A.V. , author Braunstein, I. , author Maes, P. , year 2025 a. title Your brain on chatgpt: Accumulation of cognitive debt when using an ai assistant for essay writing task . journal arXiv preprint arXiv:2506.08872 volume 4
-
[29]
author Kosmyna, N. , author Hauptmann, E. , author Yuan, Y.T. , author Situ, J. , author Liao, X.H. , author Beresnitzky, A.V. , author Braunstein, I. , author Maes, P. , year 2025 b. title Your brain on ChatGPT : Accumulation of cognitive debt when using an AI assistant for essay writing task . :10.48550/ARXIV.2506.08872. note version Number: 2
-
[30]
, year 2009
author Ku, K.Y. , year 2009 . title Assessing students’ critical thinking performance: Urging for measurements using multi-response format . journal Thinking skills and creativity volume 4 , pages 70--76
2009
-
[31]
, year 1991
author Kuhn, D. , year 1991 . title The skills of argument . publisher Cambridge University Press
1991
-
[32]
, year 2018
author Kuhn, D. , year 2018 . title A role for reasoning in a dialogic approach to critical thinking . journal Topoi volume 37 , pages 121--128
2018
-
[33]
, year 2011
author Lai, E.R. , year 2011 . title Critical thinking: A literature review . journal Pearson's research reports volume 6 , pages 40--41
2011
-
[34]
, author Jiang, Q
author Li, X. , author Jiang, Q. , author Jiang, L. , author Zhang, S. , author Hu, S. , year 2026 . title The landscape of ai alignment: A comprehensive review of theories and methods . journal International Journal of Pattern Recognition and Artificial Intelligence volume 40 , pages 2539001
2026
-
[35]
, year 2025
author Ling, J.H. , year 2025 . title A review of rubrics in education: Potential and challenges . journal Indonesian Journal of Innovative Teaching and Learning volume 2 , pages 1--14
2025
-
[36]
, author Stapleton, P
author Liu, F. , author Stapleton, P. , year 2020 . title Counterargumentation at the primary level: An intervention study investigating the argumentative writing of second language learners . journal System volume 89 , pages 102198
2020
-
[37]
author Liu, J. , year 2025 . title The role of generative AI in the process of autonomous learning of college students . journal Journal of Education, Humanities and Social Sciences volume 53 , pages 38--42 . :10.54097/brzv3w55
-
[38]
, author Masitoh, S
author Manurung, M.R. , author Masitoh, S. , author Arianto, F. , year 2022 . title How thinking routines enhance critical thinking of elementary students . journal IJORER: International Journal of Recent Educational Research volume 3 , pages 640--650
2022
-
[39]
, author Hamilton, J
author Mulaudzi, L.V. , author Hamilton, J. , year 2025 . title Lecturer’s perspective on the role of ai in personalized learning: Benefits, challenges, and ethical considerations in higher education . journal Journal of Academic Ethics volume 23 , pages 1571--1591
2025
-
[40]
, author Sinatra, G.M
author Nussbaum, E.M. , author Sinatra, G.M. , year 2003 . title Argument and conceptual engagement . journal Contemporary Educational Psychology volume 28 , pages 384--395
2003
-
[41]
, author Garc \' a, N
author Pinedo, R. , author Garc \' a, N. , author Ca \ n as, M. , year 2018 . title Thinking routines across different subjects and educational levels , in: booktitle INTED2018 Proceedings , organization IATED . pp. pages 5577--5580
2018
-
[42]
author Rahman, M.M. , author Watanobe, Y. , year 2023 . title ChatGPT for education and research: Opportunities, threats, and strategies . journal Applied Sciences volume 13 , pages 5783 . :10.3390/app13095783
-
[43]
, author Church, M
author Ritchhart, R. , author Church, M. , author Morrison, K. , year 2011 . title Making thinking visible: How to promote engagement, understanding, and independence for all learners . publisher John Wiley & Sons
2011
-
[44]
, year 2016
author Romiszowski, A.J. , year 2016 . title Designing instructional systems: Decision making in course planning and curriculum design . publisher Routledge
2016
-
[45]
, author Burns, T
author Sinfield, S. , author Burns, T. , year 2023 . title Design thinking in education: Adding collaboration, uncertainty, phronesis and fairydust to curriculum design . journal International Journal of Management and Applied Research volume 10 , pages 263--269
2023
-
[46]
, year 2003
author Toulmin, S.E. , year 2003 . title The uses of argument . publisher Cambridge university press
2003
-
[47]
, author Alkhulaifat, D
author Tripathi, S. , author Alkhulaifat, D. , author Lyo, S. , author Sukumaran, R. , author Li, B. , author Acharya, V. , author McBeth, R. , author Cook, T.S. , year 2025 . title A hitchhiker's guide to good prompting practices for large language models in radiology . journal Journal of the American College of Radiology volume 22 , pages 841--847
2025
-
[48]
, year 1980
author Watson, G. , year 1980 . title Watson-Glaser critical thinking appraisal . volume volume 3 . publisher Psychological Corporation San Antonio, TX
1980
-
[49]
, author C elik, \"O
author Yavuz, F. , author C elik, \"O . , author Yava s C elik, G. , year 2025 . title Utilizing large language models for efl essay grading: An examination of reliability and validity in rubric-based assessments . journal British Journal of Educational Technology volume 56 , pages 150--166
2025
-
[50]
, author Chiang, W.L
author Zheng, L. , author Chiang, W.L. , author Sheng, Y. , author Zhuang, S. , author Wu, Z. , author Zhuang, Y. , author Lin, Z. , author Li, Z. , author Li, D. , author Xing, E. , et al., year 2023 . title Judging llm-as-a-judge with mt-bench and chatbot arena . journal Advances in neural information processing systems volume 36 , pages 46595--46623
2023
-
[51]
, author Xie, H
author Zou, D. , author Xie, H. , author Kohnke, L. , year 2025 . title Navigating the future: establishing a framework for educators' pedagogic artificial intelligence competence . journal European Journal of Education volume 60 , pages e70117
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.