arxiv: 2605.08812 · v1 · submitted 2026-05-09 · 💰 econ.GN · q-fin.EC

Recognition: no theorem link

Little Impact of ChatGPT Availability on High School Student Test Score Performance

Nick Huntington-Klein

Pith reviewed 2026-05-12 02:19 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC

keywords ChatGPTAI in educationhigh school test scoresnatural experimenteducational impactsummer usagelearning outcomestest performance

0 comments

The pith

The availability of ChatGPT produces no meaningful change in high school test score averages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures real-world AI use by tracking how much ChatGPT activity drops during summer months when schools are closed, then compares test score trends across areas with heavy versus light summer drops. It finds no detectable shift in average high school test performance in either direction. A sympathetic reader would care because this approach captures how students actually use the tool rather than controlled experiments, and the null result suggests that any cheating or learning-aid effects are either small or offset each other in the aggregate.

Core claim

Areas with larger summer drops in ChatGPT activity, taken as a sign of heavier school-year educational use, show no difference in high school test score averages relative to areas with smaller drops. This holds for both 2023 and 2024 data and implies that, to the extent students use AI to bypass learning, the net effect on measured performance is negligible.

What carries the argument

The summer dropoff in ChatGPT activity, used as a proxy to identify locations with heavy educational AI use during the school year.

If this is right

High school test performance remains stable even where students have ready access to AI tools.
Any negative effects from using AI to avoid work are offset by positive uses or are too small to appear in aggregate scores.
Blanket restrictions on AI in high schools are unlikely to produce large gains in average test results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The null result on test scores leaves open whether AI changes deeper learning or long-term skill retention.
Similar seasonal-variation methods could be applied to other outcomes such as college enrollment or subject-specific grades.
Schools might usefully shift focus from banning AI to teaching effective integration if the main effect is neutral.

Load-bearing premise

That summer reductions in ChatGPT activity cleanly mark areas of heavy school-year educational use and that test-score data remain comparable across those areas without other seasonal or policy differences.

What would settle it

A statistically significant difference in test-score growth rates between high-summer-drop and low-summer-drop areas after accounting for pre-existing trends and other local factors.

read the original abstract

In educational settings, AI can be used as a learning aid, but can also be used to avoid schoolwork, thereby passing classes while learning little. Many existing studies on the impact of AI on education focus on AI use in controlled settings or with specialized tools. In this paper, the dropoff in ChatGPT activity during non-school summer months in 2023 and 2024 is used to identify areas with heavy educational AI use and thus estimate the educational impact of AI as it is actually used. I find no meaningful impact of AI usage on high school test score averages in either direction. These results imply that, to the extent that high school students use AI to avoid learning, it either does not matter much for their test performance or is cancelled out by positive uses of AI in the aggregate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds no effect of ChatGPT on high school test scores by using summer usage drops to flag educational intensity, but the identification rests on assumptions that need explicit checks.

read the letter

This paper's main result is a null: no meaningful change in high school test score averages tied to ChatGPT use, identified by comparing areas with bigger versus smaller drops in activity over the summer months when school is out. The abstract frames this as evidence that any cheating effects are either small or offset by helpful uses in the aggregate. The approach is new in this specific setting. It takes an existing difference-in-differences style idea from education economics and applies it to real ChatGPT usage data rather than controlled experiments or self-reports. That produces a concrete estimate on actual high-school outcomes that prior work had not delivered. The design is straightforward and the null is reported cleanly. The soft spot is the identification step. Treating the size of the summer drop as a proxy for school-year educational use assumes that areas with larger drops do not differ systematically in demographics, funding, prior trends, or other policies that also move test scores. The abstract gives no sign of balance checks, pre-trend plots, or robustness to extra covariates, so it is hard to know how much those factors are already ruled out. Without the regression tables it is difficult to judge whether the central estimate holds up once those issues are addressed. This paper is aimed at education economists and policy readers who want evidence on AI's net effect at scale rather than lab-style results. A reader looking for a timely data point on test scores would get value from it even if the identification is only partially convincing. It deserves a serious referee because the question matters for policy and the method is applied in a fresh context. A referee could focus on the identification assumptions and ask for the missing robustness material. I would send it to peer review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper exploits the drop in ChatGPT activity during non-school summer months in 2023 and 2024 as a proxy for areas with high educational AI use during the school year. It then compares high school test score averages across areas with varying proxy intensity and reports a null result: no meaningful impact of AI usage on test scores in either direction.

Significance. If the identification is valid, the null finding supplies real-world evidence on the net effects of consumer AI tools in education, indicating that any learning losses from work avoidance may be offset by positive uses. This is useful for policy discussions on AI integration in schools, as it moves beyond lab experiments to observed usage patterns.

major comments (2)

[Identification Strategy (inferred from abstract and methods description)] The identification strategy treats the magnitude of the summer drop in ChatGPT activity as a proxy for school-year educational AI intensity. For the null result to be credible, this proxy must be uncorrelated with other area-level determinants of test performance (demographics, funding, prior trends, concurrent policies) after controls. The manuscript provides no indication of balance checks, pre-trend tests, or robustness to additional covariates, leaving the central claim vulnerable to confounding.
[Results] The abstract states a clear null finding, but the available text contains no regression tables, coefficient estimates, standard errors, sample sizes, or robustness specifications. Without these, it is impossible to evaluate the precision, power, or sensitivity of the reported null result to post-hoc restrictions or alternative specifications.

minor comments (2)

[Abstract] The abstract uses 'dropoff' as a single word; standard usage is 'drop-off'.
[Data] Data sources for both ChatGPT activity metrics and test-score outcomes should be stated explicitly with links or citations to permit replication.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our paper. We address the two major comments point by point below and will revise the manuscript accordingly to improve clarity and robustness.

read point-by-point responses

Referee: The identification strategy treats the magnitude of the summer drop in ChatGPT activity as a proxy for school-year educational AI intensity. For the null result to be credible, this proxy must be uncorrelated with other area-level determinants of test performance (demographics, funding, prior trends, concurrent policies) after controls. The manuscript provides no indication of balance checks, pre-trend tests, or robustness to additional covariates, leaving the central claim vulnerable to confounding.

Authors: The identification exploits the fact that ChatGPT usage drops sharply in summer months when schools are not in session, so the size of this drop serves as a proxy for the intensity of educational (as opposed to recreational) use during the school year. We control for observable area-level characteristics such as demographics and school funding in the main specifications. We agree that the current version lacks explicit balance checks and pre-trend tests, which would strengthen the case against confounding. In the revision we will add a balance table comparing high- and low-drop areas on key covariates, report tests for differential pre-ChatGPT trends in test scores where data permit, and show results under additional covariate sets and alternative proxies. revision: yes
Referee: The abstract states a clear null finding, but the available text contains no regression tables, coefficient estimates, standard errors, sample sizes, or robustness specifications. Without these, it is impossible to evaluate the precision, power, or sensitivity of the reported null result to post-hoc restrictions or alternative specifications.

Authors: The main text reports the null result in summary form, but we acknowledge that the version under review does not present the underlying regression output in table format. In the revised manuscript we will include full regression tables showing coefficient estimates, standard errors, sample sizes, and a range of robustness specifications (alternative controls, sample restrictions, and proxy definitions). This will allow readers to assess the precision and sensitivity of the null finding directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical result from observational comparison of usage drops and test scores

full rationale

The paper's derivation is an empirical comparison: summer drops in ChatGPT activity serve as a proxy to classify areas by educational AI intensity, after which test-score averages are compared across those areas. No equations are presented that reduce the estimated impact to a fitted parameter or self-referential definition. The result is not forced by construction, self-citation chains, or renaming of known patterns; it rests on external data sources (usage metrics and test scores) whose relationship is not tautological. This is a standard difference-style design whose validity hinges on identifying assumptions rather than on any internal reduction of the outcome to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The design assumes that summer usage drops reflect school-year educational intensity and that no other time-varying factors differentially affect test scores in high- versus low-use areas. No new entities are postulated.

axioms (2)

domain assumption Summer ChatGPT activity drop is a valid proxy for school-year educational AI use intensity
Invoked to identify treatment intensity without direct usage data during the school year.
domain assumption Test-score data are comparable across regions and years after standard adjustments
Required for the difference-in-differences comparison to isolate the AI effect.

pith-pipeline@v0.9.0 · 5427 in / 1290 out tokens · 20901 ms · 2026-05-12T02:19:48.847717+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

ChatGPT as a cognitive crutch: Evidence from a randomized controlled trial on knowledge retention

ACT, Inc. (2026).Data and Visualization - ACT Research. Accessed February 16 2026.url: https://www.act.org/content/act/en/research/services-and-resources/data-and-visualizati on.html. Adair, Alexandra et al. (Oct. 2025).U.S. High School Students’ Use of Generative Artificial Intelligence: New Evidence from High School Students, Parents, and Educators. Res...

work page doi:10.1016/j.ssaho.2025.102287.url: 2026
[2]

The emerging generative artificial intelligence divide in the United States

38 Daepp, Madeleine IG and Scott Counts (2025). “The emerging generative artificial intelligence divide in the United States”. In:Proceedings of the International AAAI Conference on Web and Social Media. Vol. 19, pp. 443–456. De Simone, Martín et al. (May 2025).From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Ni...

work page doi:10.1016/j.compedu.2024.105224.url: 2025