pith. machine review for the scientific record. sign in

arxiv: 2605.09602 · v1 · submitted 2026-05-10 · ⚛️ physics.ed-ph

Recognition: 2 theorem links

· Lean Theorem

Performance and failure modes of AI chatbots on a novel concept inventory on relativity in classical mechanics

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3

classification ⚛️ physics.ed-ph
keywords AI chatbotsconcept inventoryphysics educationGalilean relativityLLM evaluationerror analysisvisual interpretation
0
0 comments X

The pith

AI chatbots score above students on a new relativity concept test yet fail entirely on items due to visual misreadings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests three frontier AI models on the Classical Relativity Concept Inventory, a 21-item instrument on Galilean relativity released after the models' training cutoff. The models reach mean accuracies from 73 percent to 97 percent, exceeding the 62 percent student average, but every model scores zero on a few specific questions. Qualitative coding of all 1890 responses shows these complete failures arise from misinterpreting diagrams rather than from missing physics knowledge. When the models err they lock onto one distractor with near-perfect consistency, while student errors spread across multiple choices. The results indicate that chatbot performance on conceptual physics tasks remains item-dependent and difficult to predict in advance.

Core claim

On a previously unpublished 21-item Classical Relativity Concept Inventory, GPT-5.2, Gemini 3 Pro, and Gemini 3 Flash achieve 73 percent, 89 percent, and 97 percent accuracy respectively against a student baseline of 62 percent, yet all three models fail completely on a small subset of items; these failures trace overwhelmingly to incorrect visual interpretations of the provided diagrams, and the models' errors converge on a single distractor far more consistently than the distributed mistakes observed among students.

What carries the argument

The Classical Relativity Concept Inventory (CRCI), a validated 21-item test on Galilean relativity, combined with repeated administration (30 trials per item) and qualitative coding of every response along visual interpretation, physics reasoning, and coordination dimensions.

If this is right

  • Chatbot reliability for physics concept questions is item-specific and cannot be assumed uniform.
  • Visual content in tests creates distinct failure modes for current language models that are separate from physics content.
  • LLM error patterns are narrower and more deterministic than the broader distribution of student errors.
  • Concept inventories must be withheld from public view to serve as uncontaminated measures of model capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model updates that incorporate the published inventory may eliminate the visual-interpretation failures observed here.
  • Instructors could deliberately include diagram-heavy items to create detectable signatures of AI-generated answers.
  • Hybrid assessment designs that combine chatbots with human review of diagram-based questions may be needed for reliable use.

Load-bearing premise

The inventory was truly absent from every model's training data and the qualitative coding of responses into visual, physics, and coordination categories is reproducible across coders.

What would settle it

Re-administer the same inventory items to the identical models after the inventory has been publicly released for several months and check whether the previously zero-scoring items now receive correct answers.

Figures

Figures reproduced from arXiv: 2605.09602 by Andrea Zamboni, Caterina Giovanzana, Eugenio Tufino, Pasquale Onorato, Stefano Oss.

Figure 1
Figure 1. Figure 1: shows the accuracy for each CRCI item — expressed as percentage of correct responses out of 30 iterations — for all three models alongside the student sample from (11) [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Item Q4 of the concept inventories from (11). The most informative cases are those where errors occur. We underscore five items: Q4 (Galilean velocity composition): This is the most impressive divergence between models. The item tests conceptual understanding of Galilean relative velocity by requiring a vector — rather than scalar — interpretation of motion and a clear geometric recognition of perpendicula… view at source ↗
read the original abstract

AI chatbots are increasingly used by students as study tools in physics, raising practical questions about their reliability on conceptual tasks. Existing evaluations of large language models (LLMs) on physics concept inventories rely almost exclusively on instruments that have been publicly available for years and likely appear in model training data, making it difficult to disentangle physics competence from familiarity with the test items themselves. We address this issue by evaluating three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a recently developed and validated 21-item instrument on Galilean relativity that was not publicly available at the time of testing. Each item was administered 30 times per model, and all 1890 responses were qualitatively coded along three dimensions: visual interpretation, physics reasoning, and coordination. Mean accuracy was 97% for Gemini 3 Flash, 89% for Gemini 3 Pro, and 73% for GPT-5.2, compared to 62% for the student sample (N = 267). However, all three models fail completely on a small number of items. The qualitative analysis shows that these failures stem predominantly from misinterpretations of visual content rather than from deficits in physics knowledge, and that LLM errors differ structurally from those of students: when models err, they converge on a single distractor with high consistency, whereas student errors are more broadly distributed. These findings indicate that chatbot reliability on conceptual physics is item-dependent and unpredictable, with direct implications for how concept inventories are administered.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates three frontier LLMs (GPT-5.2, Gemini 3 Pro, Gemini 3 Flash) on the Classical Relativity Concept Inventory (CRCI), a 21-item instrument on Galilean relativity that was not publicly available at testing time. Each item was administered 30 times per model for a total of 1890 responses, which were qualitatively coded along visual interpretation, physics reasoning, and coordination dimensions. Reported mean accuracies are 97% (Gemini 3 Flash), 89% (Gemini 3 Pro), and 73% (GPT-5.2), exceeding the 62% student benchmark (N=267). Complete failures on a few items are attributed primarily to visual misinterpretations, with LLM errors converging consistently on single distractors unlike the broader distribution of student errors.

Significance. If the central claims hold, the work is significant for providing an empirical assessment of LLM conceptual physics performance using a novel instrument designed to reduce contamination from training data. The identification of visual misinterpretation as the dominant failure mode, combined with the structural contrast in error distributions between models and students, offers concrete guidance for educational applications of chatbots and for the future design of concept inventories. The repeated administrations and full qualitative coding of all responses supply a reproducible empirical foundation.

major comments (2)
  1. [Introduction/Methods] Introduction and Methods: The claim that the CRCI constitutes a true out-of-distribution test rests on the statement that the instrument 'was not publicly available at the time of testing,' but no verification step (e.g., targeted probing of models for knowledge of specific item stems, distractors, or the inventory name) is reported. This is load-bearing for the interpretation that residual failures arise from visual misinterpretation rather than partial leakage or memorization, and for the conclusion that LLM error patterns differ structurally from student errors because of competence differences.
  2. [Qualitative analysis] Qualitative analysis section: The coding of all 1890 responses into visual interpretation, physics reasoning, and coordination categories lacks any reported inter-rater reliability statistics or agreement metrics. This weakens support for the reproducibility of the finding that failures stem predominantly from visual content misinterpretation.
minor comments (1)
  1. [Abstract] Abstract: The abstract could explicitly note the 21-item length of the CRCI and the exact student sample size (N=267) to improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below, indicating where revisions have been made to strengthen the presentation of our methods and analysis.

read point-by-point responses
  1. Referee: [Introduction/Methods] Introduction and Methods: The claim that the CRCI constitutes a true out-of-distribution test rests on the statement that the instrument 'was not publicly available at the time of testing,' but no verification step (e.g., targeted probing of models for knowledge of specific item stems, distractors, or the inventory name) is reported. This is load-bearing for the interpretation that residual failures arise from visual misinterpretation rather than partial leakage or memorization, and for the conclusion that LLM error patterns differ structurally from student errors because of competence differences.

    Authors: We agree that additional detail on the out-of-distribution status would strengthen the manuscript. The CRCI was developed by the authors and first made public after all LLM testing was completed; no items, stems, or distractors appeared in any public source prior to that date. We have revised the Methods section to include the precise development and release timeline and added a dedicated paragraph in the Limitations section discussing the (low) risk of indirect leakage through related Galilean relativity content in training corpora. We note that the complete, item-specific failures we observed are driven by visual misinterpretation rather than physics misconceptions, a pattern inconsistent with memorization of test items. These revisions clarify the evidential basis for our claims without altering the reported results. revision: partial

  2. Referee: [Qualitative analysis] Qualitative analysis section: The coding of all 1890 responses into visual interpretation, physics reasoning, and coordination categories lacks any reported inter-rater reliability statistics or agreement metrics. This weakens support for the reproducibility of the finding that failures stem predominantly from visual content misinterpretation.

    Authors: We acknowledge that formal inter-rater reliability metrics were not reported in the original submission. All responses were coded through iterative team discussion among the authors, with ambiguous cases resolved by consensus after independent review of a subset. In the revised manuscript we have expanded the Qualitative Analysis section to describe this protocol in detail, including how category boundaries were applied and how disagreements were handled. We have also added representative coded examples to the supplementary materials so readers can directly evaluate consistency. These changes improve transparency while preserving the original coding outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation on external benchmark

full rationale

The paper reports an empirical comparison of three LLMs against a student sample (N=267) on the CRCI instrument. No equations, derivations, or first-principles results are claimed. Accuracy figures and qualitative coding categories are obtained by direct administration and manual coding of 1890 responses; they do not reduce to any fitted parameter or self-referential definition. The claim that CRCI items were absent from training data is presented as a methodological premise rather than a derived result, and the structural difference in error patterns is measured against the external student distribution. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results occur. The work is therefore self-contained against its stated external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the new inventory was absent from training data and that the three-dimensional qualitative coding scheme validly captures the sources of error. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The CRCI items were not present in the training data of the tested LLMs
    Invoked to claim that high accuracy reflects physics competence rather than memorization.
  • domain assumption Qualitative coding along visual interpretation, physics reasoning, and coordination dimensions reliably identifies error sources
    Used to attribute failures to visual misinterpretation rather than physics deficits.

pith-pipeline@v0.9.0 · 5600 in / 1335 out tokens · 57727 ms · 2026-05-12T03:50:54.663257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Force concept inventory

    Hestenes D, Wells M, Swackhamer G. Force concept inventory. The Physics Teacher. 1992 March; 30: 141–158

  2. [2]

    Best Practices for Administering Concept Inventories

    Madsen A, McKagan SB, Sayre EC. Best Practices for Administering Concept Inventories. The Physics Teacher. 2017 December; 55: 530–536

  3. [3]

    Educational data augmentation in physics education research using ChatGPT

    Kieser F, Wulff P, Kuhn J, Küchemann S. Educational data augmentation in physics education research using ChatGPT. Physical Review Physics Education Research. 2023 October; 19: 020150

  4. [4]

    ChatGPT reflects student misconceptions in physics

    Wheeler S, Scherr RE. ChatGPT reflects student misconceptions in physics. In 2023 Physics Education Research Conference Proceedings; 2023 October: American Association of Physics Teachers. p. 386–390

  5. [5]

    Performance of ChatGPT on the test of understanding graphs in kinematics

    Polverini G, Gregorcic B. Performance of ChatGPT on the test of understanding graphs in kinematics. Physical Review Physics Education Research. 2024 February; 20: 010109

  6. [6]

    Performance of ChatGPT on tasks involving physics visual representations: The case of the brief electricity and magnetism assessment

    Polverini G, Melin J, Önerud E, Gregorcic B. Performance of ChatGPT on tasks involving physics visual representations: The case of the brief electricity and magnetism assessment. Physical Review Physics Education Research. 2025 May; 21: 010154

  7. [7]

    Multimodal large language models and physics visual tasks: comparative analysis of performance and costs

    Polverini G, Gregorcic B. Multimodal large language models and physics visual tasks: comparative analysis of performance and costs. European Journal of Physics. 2025 September; 46: 055708

  8. [8]

    Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment

    Ding L, Chabay R, Sherwood B, Beichner R. Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment. Physical Review Special Topics - Physics Education Research. 2006 March; 2: 010105

  9. [9]

    Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories

    Kortemeyer G, Babayeva M, Polverini G, Widenhorn R, Gregorcic B. Multilingual performance of a multimodal artificial intelligence system on multisubject physics concept inventories. Physical Review Physics Education Research. 2025 July; 21

  10. [10]

    Testing student interpretation of kinematics graphs

    Beichner RJ. Testing student interpretation of kinematics graphs. American Journal of Physics. 1994 August; 62: 750–762

  11. [11]

    A diagnostic multiple-choice questionnaire on student misconceptions about relativity in classical mechanics

    Zamboni A, Marzari A, Malgieri M, Onorato P, Oss S. A diagnostic multiple-choice questionnaire on student misconceptions about relativity in classical mechanics. European Journal of Physics. 2026 January; 47: 015708

  12. [12]

    GPT-5.2 Model

    OpenAI. GPT-5.2 Model. OpenAI API. [Online]. [cited 2026 April 26. Available from: https://developers.openai.com/api/docs/models/gpt-5.2

  13. [13]

    Gemini 3 Developer Guide

    Google AI for Developers. Gemini 3 Developer Guide. Gemini API. [Online]. [cited 2026 April 26. Available from: https://ai.google.dev/gemini-api/docs/gemini-3

  14. [14]

    Inverse Scaling: When Bigger Isn't Better

    McKenzie IR, Lyzhov A, Pieler M, Parrish A, Mueller A, Prabhu A, et al. Inverse Scaling: When Bigger Isn't Better. 2023

  15. [15]

    Comoving frames and the Lorentz–Fitzgerald contraction

    Drory A. Comoving frames and the Lorentz–Fitzgerald contraction. American Journal of Physics. 2019 January; 87: 5–9

  16. [16]

    Quantifying Information Content in Survey Data by Entropy

    Dahl FA, Østerås N. Quantifying Information Content in Survey Data by Entropy. Entropy. 2010 January; 12: 161–163

  17. [17]

    Gemini deprecations

    Google AI for Developers. Gemini deprecations. Gemini API. [Online]. [cited 2026 April

  18. [18]

    Available from: https://ai.google.dev/gemini-api/docs/deprecations

  19. [19]

    none"), used for the qualitative error analysis in sections 4.2 and 4.3. The columns (med) and (high) report GPT -5.2 accuracy at reasoning_effort =

    etufino. LLMs-performances. GitHub. [Online]. Available from: https://github.com/etufino/LLMs-performances. Supplementary Material This supplementary material provides the complete item -level data underlying the analysis presented in the main text. It includes the accuracy of each tested model and the student sample across the 21 CRCI items, the full dis...