pith. machine review for the scientific record. sign in

arxiv: 2605.08812 · v1 · submitted 2026-05-09 · 💰 econ.GN · q-fin.EC

Recognition: no theorem link

Little Impact of ChatGPT Availability on High School Student Test Score Performance

Nick Huntington-Klein

Pith reviewed 2026-05-12 02:19 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC
keywords ChatGPTAI in educationhigh school test scoresnatural experimenteducational impactsummer usagelearning outcomestest performance
0
0 comments X

The pith

The availability of ChatGPT produces no meaningful change in high school test score averages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures real-world AI use by tracking how much ChatGPT activity drops during summer months when schools are closed, then compares test score trends across areas with heavy versus light summer drops. It finds no detectable shift in average high school test performance in either direction. A sympathetic reader would care because this approach captures how students actually use the tool rather than controlled experiments, and the null result suggests that any cheating or learning-aid effects are either small or offset each other in the aggregate.

Core claim

Areas with larger summer drops in ChatGPT activity, taken as a sign of heavier school-year educational use, show no difference in high school test score averages relative to areas with smaller drops. This holds for both 2023 and 2024 data and implies that, to the extent students use AI to bypass learning, the net effect on measured performance is negligible.

What carries the argument

The summer dropoff in ChatGPT activity, used as a proxy to identify locations with heavy educational AI use during the school year.

If this is right

  • High school test performance remains stable even where students have ready access to AI tools.
  • Any negative effects from using AI to avoid work are offset by positive uses or are too small to appear in aggregate scores.
  • Blanket restrictions on AI in high schools are unlikely to produce large gains in average test results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The null result on test scores leaves open whether AI changes deeper learning or long-term skill retention.
  • Similar seasonal-variation methods could be applied to other outcomes such as college enrollment or subject-specific grades.
  • Schools might usefully shift focus from banning AI to teaching effective integration if the main effect is neutral.

Load-bearing premise

That summer reductions in ChatGPT activity cleanly mark areas of heavy school-year educational use and that test-score data remain comparable across those areas without other seasonal or policy differences.

What would settle it

A statistically significant difference in test-score growth rates between high-summer-drop and low-summer-drop areas after accounting for pre-existing trends and other local factors.

read the original abstract

In educational settings, AI can be used as a learning aid, but can also be used to avoid schoolwork, thereby passing classes while learning little. Many existing studies on the impact of AI on education focus on AI use in controlled settings or with specialized tools. In this paper, the dropoff in ChatGPT activity during non-school summer months in 2023 and 2024 is used to identify areas with heavy educational AI use and thus estimate the educational impact of AI as it is actually used. I find no meaningful impact of AI usage on high school test score averages in either direction. These results imply that, to the extent that high school students use AI to avoid learning, it either does not matter much for their test performance or is cancelled out by positive uses of AI in the aggregate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper exploits the drop in ChatGPT activity during non-school summer months in 2023 and 2024 as a proxy for areas with high educational AI use during the school year. It then compares high school test score averages across areas with varying proxy intensity and reports a null result: no meaningful impact of AI usage on test scores in either direction.

Significance. If the identification is valid, the null finding supplies real-world evidence on the net effects of consumer AI tools in education, indicating that any learning losses from work avoidance may be offset by positive uses. This is useful for policy discussions on AI integration in schools, as it moves beyond lab experiments to observed usage patterns.

major comments (2)
  1. [Identification Strategy (inferred from abstract and methods description)] The identification strategy treats the magnitude of the summer drop in ChatGPT activity as a proxy for school-year educational AI intensity. For the null result to be credible, this proxy must be uncorrelated with other area-level determinants of test performance (demographics, funding, prior trends, concurrent policies) after controls. The manuscript provides no indication of balance checks, pre-trend tests, or robustness to additional covariates, leaving the central claim vulnerable to confounding.
  2. [Results] The abstract states a clear null finding, but the available text contains no regression tables, coefficient estimates, standard errors, sample sizes, or robustness specifications. Without these, it is impossible to evaluate the precision, power, or sensitivity of the reported null result to post-hoc restrictions or alternative specifications.
minor comments (2)
  1. [Abstract] The abstract uses 'dropoff' as a single word; standard usage is 'drop-off'.
  2. [Data] Data sources for both ChatGPT activity metrics and test-score outcomes should be stated explicitly with links or citations to permit replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our paper. We address the two major comments point by point below and will revise the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: The identification strategy treats the magnitude of the summer drop in ChatGPT activity as a proxy for school-year educational AI intensity. For the null result to be credible, this proxy must be uncorrelated with other area-level determinants of test performance (demographics, funding, prior trends, concurrent policies) after controls. The manuscript provides no indication of balance checks, pre-trend tests, or robustness to additional covariates, leaving the central claim vulnerable to confounding.

    Authors: The identification exploits the fact that ChatGPT usage drops sharply in summer months when schools are not in session, so the size of this drop serves as a proxy for the intensity of educational (as opposed to recreational) use during the school year. We control for observable area-level characteristics such as demographics and school funding in the main specifications. We agree that the current version lacks explicit balance checks and pre-trend tests, which would strengthen the case against confounding. In the revision we will add a balance table comparing high- and low-drop areas on key covariates, report tests for differential pre-ChatGPT trends in test scores where data permit, and show results under additional covariate sets and alternative proxies. revision: yes

  2. Referee: The abstract states a clear null finding, but the available text contains no regression tables, coefficient estimates, standard errors, sample sizes, or robustness specifications. Without these, it is impossible to evaluate the precision, power, or sensitivity of the reported null result to post-hoc restrictions or alternative specifications.

    Authors: The main text reports the null result in summary form, but we acknowledge that the version under review does not present the underlying regression output in table format. In the revised manuscript we will include full regression tables showing coefficient estimates, standard errors, sample sizes, and a range of robustness specifications (alternative controls, sample restrictions, and proxy definitions). This will allow readers to assess the precision and sensitivity of the null finding directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result from observational comparison of usage drops and test scores

full rationale

The paper's derivation is an empirical comparison: summer drops in ChatGPT activity serve as a proxy to classify areas by educational AI intensity, after which test-score averages are compared across those areas. No equations are presented that reduce the estimated impact to a fitted parameter or self-referential definition. The result is not forced by construction, self-citation chains, or renaming of known patterns; it rests on external data sources (usage metrics and test scores) whose relationship is not tautological. This is a standard difference-style design whose validity hinges on identifying assumptions rather than on any internal reduction of the outcome to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design assumes that summer usage drops reflect school-year educational intensity and that no other time-varying factors differentially affect test scores in high- versus low-use areas. No new entities are postulated.

axioms (2)
  • domain assumption Summer ChatGPT activity drop is a valid proxy for school-year educational AI use intensity
    Invoked to identify treatment intensity without direct usage data during the school year.
  • domain assumption Test-score data are comparable across regions and years after standard adjustments
    Required for the difference-in-differences comparison to isolate the AI effect.

pith-pipeline@v0.9.0 · 5427 in / 1290 out tokens · 20901 ms · 2026-05-12T02:19:48.847717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    ChatGPT as a cognitive crutch: Evidence from a randomized controlled trial on knowledge retention

    ACT, Inc. (2026).Data and Visualization - ACT Research. Accessed February 16 2026.url: https://www.act.org/content/act/en/research/services-and-resources/data-and-visualizati on.html. Adair, Alexandra et al. (Oct. 2025).U.S. High School Students’ Use of Generative Artificial Intelligence: New Evidence from High School Students, Parents, and Educators. Res...

  2. [2]

    The emerging generative artificial intelligence divide in the United States

    38 Daepp, Madeleine IG and Scott Counts (2025). “The emerging generative artificial intelligence divide in the United States”. In:Proceedings of the International AAAI Conference on Web and Social Media. Vol. 19, pp. 443–456. De Simone, Martín et al. (May 2025).From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Ni...