Recognition: 2 theorem links
· Lean TheoremPerformance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics
Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3
The pith
AI chatbots score above students on a new relativity concept test yet fail entirely on items due to visual misreadings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On a previously unpublished 21-item Classical Relativity Concept Inventory, GPT-5.2, Gemini 3 Pro, and Gemini 3 Flash achieve 73 percent, 89 percent, and 97 percent accuracy respectively against a student baseline of 62 percent, yet all three models fail completely on a small subset of items; these failures trace overwhelmingly to incorrect visual interpretations of the provided diagrams, and the models' errors converge on a single distractor far more consistently than the distributed mistakes observed among students.
What carries the argument
The Classical Relativity Concept Inventory (CRCI), a validated 21-item test on Galilean relativity, combined with repeated administration (30 trials per item) and qualitative coding of every response along visual interpretation, physics reasoning, and coordination dimensions.
If this is right
- Chatbot reliability for physics concept questions is item-specific and cannot be assumed uniform.
- Visual content in tests creates distinct failure modes for current language models that are separate from physics content.
- LLM error patterns are narrower and more deterministic than the broader distribution of student errors.
- Concept inventories must be withheld from public view to serve as uncontaminated measures of model capability.
Where Pith is reading between the lines
- Future model updates that incorporate the published inventory may eliminate the visual-interpretation failures observed here.
- Instructors could deliberately include diagram-heavy items to create detectable signatures of AI-generated answers.
- Hybrid assessment designs that combine chatbots with human review of diagram-based questions may be needed for reliable use.
Load-bearing premise
The inventory was truly absent from every model's training data and the qualitative coding of responses into visual, physics, and coordination categories is reproducible across coders.
What would settle it
Re-administer the same inventory items to the identical models after the inventory has been publicly released for several months and check whether the previously zero-scoring items now receive correct answers.
Figures
read the original abstract
AI chatbots are increasingly used by students as study tools in physics, raising practical questions about their reliability on conceptual tasks. Existing evaluations of large language models (LLMs) on physics concept inventories rely almost exclusively on instruments that have been publicly available for years and likely appear in model training data, making it difficult to disentangle physics competence from familiarity with the test items themselves. We address this issue by evaluating three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a recently developed and validated 21-item instrument on Galilean relativity that was not publicly available at the time of testing. Each item was administered 30 times per model, and all 1890 responses were qualitatively coded along three dimensions: visual interpretation, physics reasoning, and coordination. Mean accuracy was 97% for Gemini 3 Flash, 89% for Gemini 3 Pro, and 73% for GPT-5.2, compared to 62% for the student sample (N = 267). However, all three models fail completely on a small number of items. The qualitative analysis shows that these failures stem predominantly from misinterpretations of visual content rather than from deficits in physics knowledge, and that LLM errors differ structurally from those of students: when models err, they converge on a single distractor with high consistency, whereas student errors are more broadly distributed. These findings indicate that chatbot reliability on conceptual physics is item-dependent and unpredictable, with direct implications for how concept inventories are administered.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a 21-item instrument on Galilean relativity that was not publicly available at testing time. Each item was administered 30 times per model for a total of 1890 responses, which were qualitatively coded along visual interpretation, physics reasoning, and coordination dimensions. Reported mean accuracies are 97% (Gemini 3 Flash), 89% (Gemini 3 Pro), and 73% (GPT-5.2), exceeding the 62% student benchmark (N=267). Complete failures on a few items are attributed primarily to visual misinterpretations, with LLM errors converging consistently on single distractors unlike the broader distribution of student errors.
Significance. If the central claims hold, the work is significant for providing an empirical assessment of LLM conceptual physics performance using a novel instrument designed to reduce contamination from training data. The identification of visual misinterpretation as the dominant failure mode, combined with the structural contrast in error distributions between models and students, offers concrete guidance for educational applications of chatbots and for the future design of concept inventories. The repeated administrations and full qualitative coding of all responses supply a reproducible empirical foundation.
major comments (2)
- [Introduction/Methods] Introduction and Methods: The claim that the CRCI constitutes a true out-of-distribution test rests on the statement that the instrument 'was not publicly available at the time of testing,' but no verification step (e.g., targeted probing of models for knowledge of specific item stems, distractors, or the inventory name) is reported. This is load-bearing for the interpretation that residual failures arise from visual misinterpretation rather than partial leakage or memorization, and for the conclusion that LLM error patterns differ structurally from student errors because of competence differences.
- [Qualitative analysis] Qualitative analysis section: The coding of all 1890 responses into visual interpretation, physics reasoning, and coordination categories lacks any reported inter-rater reliability statistics or agreement metrics. This weakens support for the reproducibility of the finding that failures stem predominantly from visual content misinterpretation.
minor comments (1)
- [Abstract] Abstract: The abstract could explicitly note the 21-item length of the CRCI and the exact student sample size (N=267) to improve immediate clarity for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below, indicating where revisions have been made to strengthen the presentation of our methods and analysis.
read point-by-point responses
-
Referee: [Introduction/Methods] Introduction and Methods: The claim that the CRCI constitutes a true out-of-distribution test rests on the statement that the instrument 'was not publicly available at the time of testing,' but no verification step (e.g., targeted probing of models for knowledge of specific item stems, distractors, or the inventory name) is reported. This is load-bearing for the interpretation that residual failures arise from visual misinterpretation rather than partial leakage or memorization, and for the conclusion that LLM error patterns differ structurally from student errors because of competence differences.
Authors: We agree that additional detail on the out-of-distribution status would strengthen the manuscript. The CRCI was developed by the authors and first made public after all LLM testing was completed; no items, stems, or distractors appeared in any public source prior to that date. We have revised the Methods section to include the precise development and release timeline and added a dedicated paragraph in the Limitations section discussing the (low) risk of indirect leakage through related Galilean relativity content in training corpora. We note that the complete, item-specific failures we observed are driven by visual misinterpretation rather than physics misconceptions, a pattern inconsistent with memorization of test items. These revisions clarify the evidential basis for our claims without altering the reported results. revision: partial
-
Referee: [Qualitative analysis] Qualitative analysis section: The coding of all 1890 responses into visual interpretation, physics reasoning, and coordination categories lacks any reported inter-rater reliability statistics or agreement metrics. This weakens support for the reproducibility of the finding that failures stem predominantly from visual content misinterpretation.
Authors: We acknowledge that formal inter-rater reliability metrics were not reported in the original submission. All responses were coded through iterative team discussion among the authors, with ambiguous cases resolved by consensus after independent review of a subset. In the revised manuscript we have expanded the Qualitative Analysis section to describe this protocol in detail, including how category boundaries were applied and how disagreements were handled. We have also added representative coded examples to the supplementary materials so readers can directly evaluate consistency. These changes improve transparency while preserving the original coding outcomes. revision: yes
Circularity Check
No circularity: direct empirical evaluation on external benchmark
full rationale
The paper reports an empirical comparison of three LLMs against a student sample (N=267) on the CRCI instrument. No equations, derivations, or first-principles results are claimed. Accuracy figures and qualitative coding categories are obtained by direct administration and manual coding of 1890 responses; they do not reduce to any fitted parameter or self-referential definition. The claim that CRCI items were absent from training data is presented as a methodological premise rather than a derived result, and the structural difference in error patterns is measured against the external student distribution. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. The work is therefore self-contained against its stated external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The CRCI items were not present in the training data of the tested LLMs
- domain assumption Qualitative coding along visual interpretation, physics reasoning, and coordination dimensions reliably identifies error sources
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, Aczél classification)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mean accuracy was 97% for Gemini 3 Flash, 89% for Gemini 3 Pro, and 73% for GPT-5.2 ... LLM errors differ structurally from those of students: when models err, they converge on a single distractor with high consistency
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Hestenes D, Wells M, Swackhamer G. Force concept inventory. The Physics Teacher. 1992 March; 30: 141–158
work page 1992
-
[2]
Best Practices for Administering Concept Inventories
Madsen A, McKagan SB, Sayre EC. Best Practices for Administering Concept Inventories. The Physics Teacher. 2017 December; 55: 530–536
work page 2017
-
[3]
Educational data augmentation in physics education research using ChatGPT
Kieser F, Wulff P, Kuhn J, Küchemann S. Educational data augmentation in physics education research using ChatGPT. Physical Review Physics Education Research. 2023 October; 19: 020150
work page 2023
-
[4]
ChatGPT reflects student misconceptions in physics
Wheeler S, Scherr RE. ChatGPT reflects student misconceptions in physics. In 2023 Physics Education Research Conference Proceedings; 2023 October: American Association of Physics Teachers. p. 386–390
work page 2023
-
[5]
Performance of ChatGPT on the test of understanding graphs in kinematics
Polverini G, Gregorcic B. Performance of ChatGPT on the test of understanding graphs in kinematics. Physical Review Physics Education Research. 2024 February; 20: 010109
work page 2024
-
[6]
Polverini G, Melin J, Önerud E, Gregorcic B. Performance of ChatGPT on tasks involving physics visual representations: The case of the brief electricity and magnetism assessment. Physical Review Physics Education Research. 2025 May; 21: 010154
work page 2025
-
[7]
Polverini G, Gregorcic B. Multimodal large language models and physics visual tasks: comparative analysis of performance and costs. European Journal of Physics. 2025 September; 46: 055708
work page 2025
-
[8]
Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment
Ding L, Chabay R, Sherwood B, Beichner R. Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment. Physical Review Special Topics - Physics Education Research. 2006 March; 2: 010105
work page 2006
-
[9]
Kortemeyer G, Babayeva M, Polverini G, Widenhorn R, Gregorcic B. Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories. Physical Review Physics Education Research. 2025 July; 21
work page 2025
-
[10]
Testing student interpretation of kinematics graphs
Beichner RJ. Testing student interpretation of kinematics graphs. American Journal of Physics. 1994 August; 62: 750–762
work page 1994
-
[11]
Zamboni A, Marzari A, Malgieri M, Onorato P, Oss S. A diagnostic multiple-choice questionnaire on student misconceptions about relativity in classical mechanics. European Journal of Physics. 2026 January; 47: 015708
work page 2026
-
[12]
OpenAI. GPT-5.2 Model. OpenAI API. [Online]. [cited 2026 April 26. Available from: https://developers.openai.com/api/docs/models/gpt-5.2
work page 2026
-
[13]
Google AI for Developers. Gemini 3 Developer Guide. Gemini API. [Online]. [cited 2026 April 26. Available from: https://ai.google.dev/gemini-api/docs/gemini-3
work page 2026
-
[14]
Inverse Scaling: When Bigger Isn't Better
McKenzie IR, Lyzhov A, Pieler M, Parrish A, Mueller A, Prabhu A, et al. Inverse Scaling: When Bigger Isn't Better. 2023
work page 2023
-
[15]
Comoving frames and the Lorentz–Fitzgerald contraction
Drory A. Comoving frames and the Lorentz–Fitzgerald contraction. American Journal of Physics. 2019 January; 87: 5–9
work page 2019
-
[16]
Quantifying Information Content in Survey Data by Entropy
Dahl FA, Østerås N. Quantifying Information Content in Survey Data by Entropy. Entropy. 2010 January; 12: 161–163
work page 2010
-
[17]
Google AI for Developers. Gemini deprecations. Gemini API. [Online]. [cited 2026 April
work page 2026
-
[18]
Available from: https://ai.google.dev/gemini-api/docs/deprecations
-
[19]
etufino. LLMs-performances. GitHub. [Online]. Available from: https://github.com/etufino/LLMs-performances. Supplementary Material This supplementary material provides the complete item -level data underlying the analysis presented in the main text. It includes the accuracy of each tested model and the student sample across the 21 CRCI items, the full dis...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.