arxiv: 2309.03087 · v1 · submitted 2023-08-21 · ⚛️ physics.ed-ph · cs.AI· cs.HC

Recognition: unknown

Unreflected Acceptance -- Investigating the Negative Consequences of ChatGPT-Assisted Problem Solving in Physics Education

Jakob Karolus, Jochen Kuhn, Karina E. Avila, Lars Krupp, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz, Stefan K\"uchemann, Steffen Steinert

Pith reviewed 2026-05-06 19:38 UTC · model claude-opus-4-7

classification ⚛️ physics.ed-ph cs.AIcs.HC PACS 01.40.-d01.50.H-

keywords large language models in educationChatGPTphysics education researchproblem solvingover-trust in AIhuman-AI interactionhigher education

0 comments

The pith

Physics students using ChatGPT accepted about half of its wrong answers as correct and pasted exercise text into it ten times more often than into a search engine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks a narrow but pointed question: when physics students with relevant background use ChatGPT to solve exercises, how does their behaviour and judgement differ from students using a normal search engine? Across 27 ChatGPT users and 12 search users, the authors report two effects. First, students self-rated nearly half of their ChatGPT-assisted solutions as correct when those solutions were in fact wrong — an over-trust signal even within their domain. Second, the way students engaged with the two tools differed sharply: 42% of ChatGPT queries were copy-pastes of the exercise prompt itself, against 4% in the search-engine arm, suggesting that the chatbot's conversational interface short-circuits the reformulation and search-and-compare habits that web search forces. The authors conclude that LLM use in problem-solving education needs explicit guidance and explicit warnings, not just availability.

Core claim

In a between-subjects study of physics students solving exercises, those given ChatGPT accepted nearly half of its incorrect answers as correct, and pasted the exercise text verbatim into the chatbot in 42% of queries — compared with 4% for the search-engine group. The authors read this as direct evidence that conversational LLMs invite a passive, low-reflection mode of use even from students working inside their own field of expertise, and that this over-trust is a behavioural effect of the tool, not just a knowledge gap.

What carries the argument

A controlled between-subjects comparison on a fixed set of physics exercises, with two outcome layers: solution correctness paired with student self-assessment of correctness (yielding the over-trust gap), and a query-behaviour coding scheme that flags copy-paste of the exercise text as a marker of unreflected interaction. The contrast between the two arms is what carries the argument.

If this is right

Instruction in higher physics education needs to treat LLM use as a skill to be taught, not a tool to be left to discover, including explicit prompts to reformulate problems before querying.
Assessment design that allows take-home problem solving must assume that copy-paste-and-accept is the default student behaviour with LLMs, and adjust either the task format or the verification step accordingly.
Tool designers who care about learning outcomes have a target to hit: reduce the rate at which confident-sounding wrong answers are accepted by domain-trained users, e.g. via calibrated uncertainty or required justification steps.
Comparisons between LLMs and search engines should track interaction behaviour (query reformulation, cross-checking) and not only final answer accuracy, because the tools shape the user's reasoning loop differently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 42% vs 4% copy-paste gap is probably the more durable finding than the over-trust percentage, because it measures behaviour directly rather than through self-assessment, and it likely replicates across domains beyond physics.
Over-trust in one's own field is the sharper version of the worry: if domain-trained users still accept wrong outputs at high rates, naive users are unlikely to do better, so the effect size in general populations may be larger.
A natural follow-up is to test whether forcing users to write their own attempt before querying, or to mark which steps they verified, closes the over-trust gap without removing the productivity benefits.
The result is consistent with a broader pattern in which fluent, conversational interfaces suppress the meta-cognitive 'does this answer make sense' check that ranked search results, by their fragmented form, tend to preserve.

Load-bearing premise

That a small, unevenly split sample (27 vs 12) on one exercise set, using students' own self-ratings as the measure of misplaced trust, is enough to attribute the gap to the tool itself rather than to ability differences, exercise difficulty, or which version of the chatbot was being used.

What would settle it

Repeat the study with balanced arms, blind expert grading of correctness, pre-tested matched groups on prior physics ability, and a fixed declared LLM version. If the over-trust rate drops to chance-level disagreement with experts and the copy-paste gap shrinks once interface habits are controlled for, the behavioural-effect claim does not survive.

Figures

Figures reproduced from arXiv: 2309.03087 by Jakob Karolus, Jochen Kuhn, Karina E. Avila, Lars Krupp, Maximilian Kiefer-Emmanouilidis, Paul Lukowicz, Stefan K\"uchemann, Steffen Steinert.

**Figure 1.** Figure 1: The study procedure timeline in detail. First a self view at source ↗

**Figure 2.** Figure 2: The proportion of positively rated ChatGPT an view at source ↗

**Figure 3.** Figure 3: Distribution of interaction types for each question view at source ↗

read the original abstract

Large language models (LLMs) have recently gained popularity. However, the impact of their general availability through ChatGPT on sensitive areas of everyday life, such as education, remains unclear. Nevertheless, the societal impact on established educational methods is already being experienced by both students and educators. Our work focuses on higher physics education and examines problem solving strategies. In a study, students with a background in physics were assigned to solve physics exercises, with one group having access to an internet search engine (N=12) and the other group being allowed to use ChatGPT (N=27). We evaluated their performance, strategies, and interaction with the provided tools. Our results showed that nearly half of the solutions provided with the support of ChatGPT were mistakenly assumed to be correct by the students, indicating that they overly trusted ChatGPT even in their field of expertise. Likewise, in 42% of cases, students used copy & paste to query ChatGPT -- an approach only used in 4% of search engine queries -- highlighting the stark differences in interaction behavior between the groups and indicating limited reflection when using ChatGPT. In our work, we demonstrated a need to (1) guide students on how to interact with LLMs and (2) create awareness of potential shortcomings for users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Useful early data point on physics students over-trusting ChatGPT, but the headline number needs the search-arm baseline before it carries the weight the abstract puts on it.

read the letter

Quick take: this is an empirical study of physics students solving exercises either with a search engine (N=12) or ChatGPT (N=27). Two findings the abstract leans on: ~half of ChatGPT-assisted solutions were wrong but self-rated as correct, and 42% of ChatGPT prompts were copy-pastes of the exercise vs. 4% for search queries. The educational angle is timely and the behavioral observation is worth having in print.

What's actually new and good: there isn't much published yet on how physics-trained students interact with LLMs at the prompt level, and the copy-paste contrast is a clean, concrete behavioral measurement that other groups can replicate. The framing — that we need to teach students how to interact with LLMs, not just whether to — is the right framing. As a small, exploratory pilot in a literature that is still mostly opinion pieces, it earns its place.

Soft spots, in proportion. The stress-test note is on target: the "~50% over-trust" claim is only interpretable as ChatGPT-specific if the search-engine arm's self-rating-vs-ground-truth miscalibration is meaningfully lower. Physics students are known to be poor self-graders in general. The abstract doesn't state the analogous number for the N=12 arm, and that is the load-bearing comparison. If the full text reports it and it's much lower, the headline holds; if not, the claim should be softened to "students did not catch ChatGPT errors" rather than "over-trust." Worth checking before citing the 50% figure.

The 42% vs 4% copy-paste gap is real but partly an affordance effect — chatbots accept prose, search engines reward keywords. Calling it "limited reflection" needs either a within-ChatGPT-arm analysis (do copy-pasters do worse than reformulators?) or an explicit acknowledgment that interaction style ≠ cognitive engagement.

Smaller issues: N=12 vs N=27 is unbalanced and underpowered for subgroup analysis; the GPT version should be stated; prior-ability controls matter on a sample this size. None of these sink the paper, but they cap how far the numbers can travel.

Who it's for: physics ed researchers and curriculum designers thinking about AI policy. Not a foundational result — one of several converging findings across domains — but a useful localized data point.

Recommendation: send to peer review. The design is honest, the data are reportable, and the main thing referees should push on is the missing baseline comparison and the interpretation of copy-paste behavior. With those addressed (or honestly flagged as limitations), it's a fine short paper. I'd skim it for context but probably wouldn't cite the 50% number without seeing the search-arm calibration figure.

Referee Report

5 major / 5 minor

Summary. The manuscript reports a between-subjects study of physics students solving physics exercises with access either to an internet search engine (N=12) or to ChatGPT (N=27). The authors evaluate task performance, problem-solving strategies, and interaction behavior with the assigned tool. Two headline findings are emphasized: (i) approximately half of the ChatGPT-supported solutions were incorrect yet self-rated as correct by the students, which the authors interpret as over-trust in ChatGPT even within the students' own field of expertise; and (ii) 42% of ChatGPT queries were copy-paste of the exercise text, versus 4% in the search-engine condition, which the authors interpret as evidence of limited reflection. From these results the authors argue for guidance on LLM interaction and for raising user awareness of LLM shortcomings.

Significance. The topic is timely and relevant to the physics-education-research community: empirical data on how physics students actually interact with ChatGPT during problem solving are still scarce, and a head-to-head comparison with a traditional search-engine condition is a sensible design choice. The behavioral observation that copy-paste querying is roughly an order of magnitude more frequent with ChatGPT than with a search engine is, on its face, a useful and reportable empirical fact, and the self-rating-vs-correctness mismatch is a plausible operationalization of over-trust. If the effects survive the methodological scrutiny below — in particular if the search-engine arm shows a meaningfully lower self-rating miscalibration — the paper would contribute concrete motivation for instructional interventions around LLM use in physics. As an exploratory, hypothesis-generating study with modest N, the contribution is appropriately scoped for physics.ed-ph.

major comments (5)

[Results — '~50% over-trust' headline] The central interpretive claim (over-trust in ChatGPT) requires the analogous self-rating-vs-ground-truth disagreement rate in the search-engine arm as a baseline. Physics students are known to be imperfect self-graders generally; without the search arm's miscalibration rate reported alongside the ChatGPT arm's ~50%, the headline is consistent with generic self-assessment overconfidence rather than an LLM-specific effect. Please report the search-arm miscalibration rate explicitly, with a statistical comparison (e.g., a proportion test or mixed model with item as a random effect), and qualify the 'over-trust' framing accordingly if the gap is small.
[Methods — sample and allocation] The N=12 vs N=27 allocation is unequal and the search arm is small; the abstract does not indicate whether assignment was randomized, whether prior ability (e.g., course grade, year of study) was measured and balanced across arms, or whether exercise difficulty and order were controlled. Because the headline statistics are proportions on small denominators, the conclusions are sensitive to these choices. Please document randomization, pre-assignment covariates, and any adjustment for them; report confidence intervals on the headline percentages.
[Methods — outcome measurement] Using student self-rating as the 'correctness' signal that defines over-trust conflates two constructs (objective correctness, and the student's belief about correctness). The over-trust claim requires blind expert grading of the same solutions against an answer key, with inter-rater agreement reported, so that 'wrong but rated correct' is a well-defined event. If grading was done by the authors non-blind to condition, this should be acknowledged as a limitation.
[Results — 42% vs 4% copy-paste contrast] Interpreting the copy-paste asymmetry as 'limited reflection' is not the only available reading: ChatGPT accepts natural-language prompts whereas search engines reward keyword queries, so part of the gap is affordance-driven. To support the reflection interpretation, please show within-ChatGPT-arm evidence that copy-paste queries are associated with worse outcomes and/or larger self-rating miscalibration than reformulated prompts. Without this, the 42%/4% number documents an interaction-style difference but does not by itself license the cognitive claim.
[Methods — model version and prompt logging] The abstract does not state which ChatGPT version/model was used, the date window of data collection, the system prompt (if any), or whether conversations were logged verbatim. Because LLM correctness on physics problems varies substantially across versions, this information is load-bearing for any claim about 'ChatGPT' as a category and for reproducibility. Please specify the model, access mode, and provide (anonymized) transcripts or at minimum query/response statistics in supplementary material.

minor comments (5)

[Abstract] The phrase 'students with a background in physics' is vague — please specify program, year, and the courses already completed, since 'their field of expertise' is doing interpretive work in the conclusion.
[Abstract] Clarify whether the 42% / 4% figures are computed per query, per session, or per student; the denominator matters for how the contrast should be read.
[Exercises] Please list the exercises used (or include them in an appendix), with their topic, difficulty level, and source, so readers can judge generalizability and whether ChatGPT's known weaknesses on specific problem types drive the correctness gap.
[Framing] The phrasing 'mistakenly assumed to be correct' presupposes an objective grader; tighten the wording to make the grading procedure explicit at first mention.
[Recommendations] The two recommendations (guide students; raise awareness) are reasonable but generic. Linking them to specific observed failure modes in the data (e.g., particular kinds of physics errors that went unnoticed) would strengthen the paper's practical contribution.

Simulated Author's Rebuttal

5 responses · 2 unresolved

We thank the referee for an engaged and constructive report. The five major comments converge on a coherent critique: the abstract advances two cognitive claims (over-trust; limited reflection) whose support requires (a) a within-study baseline from the search arm, (b) a clean separation of objective correctness from self-rating, (c) within-ChatGPT-arm evidence linking copy-paste prompting to worse outcomes, and (d) full reporting of model version, allocation, and covariates. We accept this critique substantially in full. The revision will: report search-arm miscalibration alongside the ChatGPT figure with proportion tests and Wilson CIs; document the (non-individually-randomized) allocation, balance on prior-ability covariates, and fixed exercise order; describe the expert-graded answer key, add a blind second rater on a subsample with Cohen's \u03ba, and flag non-blind grading where applicable; add a within-ChatGPT analysis tying query type to correctness and miscalibration, with the affordance confound stated explicitly; and specify model version, access mode, collection window, and release anonymized transcripts/aggregate query statistics as supplementary material. Where evidence does not survive these tests, we will downgrade the corresponding claim (e.g., 'limited reflection' \u2192 descriptive interaction-style difference; 'over-trust' \u2192 condition-dependent miscalibration of the magnitude actually observed). We believe the revised manuscript will retain its core e

read point-by-point responses

Referee: Headline ~50% over-trust claim requires the analogous self-rating-vs-ground-truth miscalibration rate in the search-engine arm as a baseline, with a statistical comparison; otherwise the effect could be generic student overconfidence rather than LLM-specific.

Authors: We agree that the over-trust framing is only defensible against a within-study baseline. The miscalibration rate in the search-engine arm was computed but, as the referee correctly infers from the abstract, was not foregrounded with the same prominence as the ChatGPT figure. In the revision we will (i) report the search-arm 'wrong-but-self-rated-correct' rate side-by-side with the ChatGPT arm, (ii) add a Fisher's exact / two-proportion z-test together with 95% Wilson confidence intervals, and (iii) where appropriate, fit a mixed-effects logistic model with exercise as a random effect and condition as a fixed effect. We will soften the 'over-trust' wording to 'condition-dependent miscalibration' and reserve the stronger framing for the magnitude of the gap that survives this analysis. We thank the referee for forcing this comparison into the headline. revision: yes
Referee: N=12 vs N=27 is unequal and small; the abstract does not indicate randomization, pre-assignment covariates (prior ability, year of study), control of exercise difficulty/order, or confidence intervals on the headline percentages.

Authors: We will document this much more carefully. Assignment to condition was made at the session level rather than by individually randomized allocation, which is the proximate cause of the unequal cell sizes; we will state this explicitly and treat it as a limitation rather than as randomization. Background covariates (semester of study, prior physics coursework) were collected and will be reported by arm with balance statistics. Exercise set and order were held fixed across participants, which we will state. All headline proportions will be reported with Wilson 95% CIs, and we will add a sensitivity analysis adjusting for the available covariates. We agree the small search-arm denominator limits power and will temper the conclusions accordingly. revision: yes
Referee: Using student self-rating as the correctness signal conflates objective correctness with belief about correctness. Over-trust requires blind expert grading against an answer key with inter-rater agreement; non-blind author grading should be acknowledged.

Authors: To be clear, the operationalization in the manuscript is two-channel: an expert-graded correctness label against a worked answer key, and the student's self-rating; the 'wrong-but-rated-correct' event is the disagreement between these two channels, not self-rating alone. We acknowledge, however, that the abstract elides this and that the grading procedure is under-described. In the revision we will: (i) describe the answer key and rubric, (ii) have a second rater independently grade a substantial subsample blind to condition and report Cohen's κ, and (iii) state explicitly whether the primary grader was blind to condition. Where blinding was not possible (e.g., transcripts contain ChatGPT artifacts), we will flag this as a residual limitation. revision: yes
Referee: The 42% vs 4% copy-paste contrast may be affordance-driven (natural-language vs keyword interfaces) rather than reflective of cognition. Within-ChatGPT-arm evidence is needed that copy-paste queries are associated with worse outcomes or greater miscalibration.

Authors: This is a fair and important point. The affordance asymmetry is real and we will state it up front in the Discussion: search engines structurally penalize verbatim prompts, so the across-condition contrast cannot by itself carry the 'limited reflection' interpretation. To support the cognitive reading we will add a within-ChatGPT-arm analysis correlating query type (verbatim copy-paste vs. reformulated/decomposed prompt) with (a) objective correctness of the final solution and (b) self-rating miscalibration, with appropriate clustering by participant. If that within-arm association is weak, we will downgrade the claim to a descriptive interaction-style finding, as the referee suggests. revision: yes
Referee: The model version, date window, system prompt, and whether conversations were logged verbatim are not specified; this is load-bearing for any 'ChatGPT' claim and for reproducibility.

Authors: Agreed. The full text specifies the access mode and collection window, but the abstract does not, and the supplementary information is thinner than it should be. In the revision we will state in both abstract and methods: the specific model/version used, the access route (web interface, default settings, no custom system prompt), the calendar window of data collection, and the fact that full conversations were logged. We will release anonymized transcripts and per-query metadata (length, copy-paste flag, follow-up depth) as supplementary material, subject to participant consent, and where individual transcripts cannot be released we will provide aggregated query/response statistics. We will also explicitly scope all claims to the model version tested rather than to 'ChatGPT' as a category. revision: yes

standing simulated objections not resolved

The unequal allocation (N=12 search vs N=27 ChatGPT) cannot be repaired post hoc; the search arm will remain underpowered and we will state this as a limitation rather than claim it away.
Because ChatGPT transcripts contain identifying artifacts of the condition, fully blinded expert grading of the ChatGPT arm is only partially achievable; we can blind a second rater to condition for the search arm and to authorship within the ChatGPT arm, but not to the tool used.

Circularity Check

0 steps flagged

No meaningful circularity: an empirical between-subjects study whose weaknesses are statistical/design issues, not definitional loops.

full rationale

This is an empirical education study comparing two groups of students (search engine vs. ChatGPT) on physics problem solving. The headline claims — ~50% of ChatGPT-assisted solutions wrong but self-rated correct, and 42% vs 4% copy-paste rates — are measurements against external referents (ground-truth correctness of physics solutions, observed query behavior). Nothing in the abstract reduces a "prediction" to a fitted input, renames a known result, or imports a uniqueness theorem from the authors' prior work. The skeptic's concerns (missing baseline self-rating accuracy in the N=12 search arm, affordance-driven copy-paste behavior, small/unequal N, unspecified model version) are legitimate but they are correctness/generalizability concerns, not circularity. The interpretive leap from "wrong-but-self-rated-correct" to "over-trust in ChatGPT" is an inferential gap (needs the search-arm baseline), not a definitional loop where the conclusion is the input. Score 1 reflects only that, on the abstract alone, the "over-trust" label is partially constituted by the same self-rating data it purports to explain — but this is mild interpretive slippage, not load-bearing circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Model omitted the axiom ledger; defaulted for pipeline continuity.

pith-pipeline@v0.9.0 · 9892 in / 1699 out tokens · 34009 ms · 2026-05-06T19:38:34.179946+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Generative AI Use in Entrepreneurship: An Integrative Review and an Empowerment-Entrapment Framework
cs.CY 2026-04 conditional novelty 6.0

The paper proposes the Empowerment-Entrapment Framework showing that generative AI both empowers and entraps entrepreneurs at each stage of the entrepreneurial process.