Who Gets to Do Physics? Occupational Stereotypes in AI-Generated Problem Sets
Pith reviewed 2026-05-20 07:03 UTC · model grok-4.3
The pith
AI-generated physics problems assign hazards mainly to migrant and construction workers while reserving ownership language for CEOs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across the generated problems, hazardous scenarios concentrate in those involving migrant workers and construction workers, exposure-related hazards appear especially often with migrant workers, passive-accident framing occurs in one in eight migrant-worker problems and never for physicists, teachers, or CEOs, and possessive ownership language appears almost exclusively with the CEO.
What carries the argument
Five-dimension coding scheme that records hazard presence, hazard type, agency role, cognitive role, and object ownership in each problem's narrative.
If this is right
- Instructors who use AI to create homework should apply a quick checklist for occupational framing before assigning the problems.
- AI tools can produce technically accurate physics while still embedding familiar workplace hierarchies in the stories they tell.
- Introductory physics courses may unintentionally reinforce ideas about which jobs involve risk and which involve control.
Where Pith is reading between the lines
- The same occupational patterns could surface in AI-generated problems for chemistry or biology if similar occupation prompts are used.
- Developers might reduce these effects by auditing the training data for associations between job titles and risk or ownership.
- A follow-up study could measure whether students notice or internalize the role differences after working with the problems.
Load-bearing premise
Differences in occupational framing arise from the AI models' learned associations rather than from the exact wording of the prompts or from the researchers' application of the five coding categories.
What would settle it
Regenerate the same occupation prompts with deliberately varied wording or with human coders kept unaware of which occupation is named, then check whether the same concentrations of hazards, passive framing, and ownership language still appear.
Figures
read the original abstract
As AI-generated problem sets gain traction in introductory physics courses, their technical correctness is well established - but the social assumptions embedded in their framing have gone largely unexamined. This study analyzes 600 introductory physics problems generated by four AI systems - Grok~4, GPT-5.2, Claude Sonnet 4.6, and Gemini 3 Flash - across structured prompts involving occupations (CEO, Physicist, High School Teacher, Nurse, Construction Worker, and Migrant Worker). Problems were coded on five dimensions: hazard presence, hazard type, agency role, cognitive role, and object ownership. While the physics content is technically sound across all platforms, our analysis reveals systematic occupational stratification in narrative framing. Hazardous scenarios were concentrated in Migrant Worker and Construction Worker problems, with exposure-related hazards (electrocution, burns, radiation, heat or chemical exposure) especially concentrated in Migrant Worker problems. Passive-accident framing - the persona as the recipient of an injury - appeared in one in eight Migrant Worker problems and never appeared for the Physicist, Teacher, or CEO. Possessive ownership language was reserved almost exclusively for the CEO. These patterns suggest that AI-generated physics problems can introduce surface-level diversity while reproducing occupational hierarchies in who acts, who owns, and who is placed at risk. We discuss implications for physics teaching and offer simple screening strategies for instructors using AI-generated problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes 600 introductory physics problems generated by four AI systems (Grok 4, GPT-5.2, Claude Sonnet 4.6, and Gemini 3 Flash) using structured prompts involving six occupations. Problems are coded on five dimensions: hazard presence, hazard type, agency role, cognitive role, and object ownership. The analysis finds hazardous scenarios concentrated in Migrant Worker and Construction Worker problems, with exposure-related hazards especially in Migrant Worker problems; passive-accident framing appears in one in eight Migrant Worker problems but never for Physicist, Teacher, or CEO; and possessive ownership language is reserved almost exclusively for the CEO. The paper concludes that AI-generated physics problems can introduce surface-level diversity while reproducing occupational hierarchies in who acts, who owns, and who is placed at risk, and discusses implications for physics teaching along with screening strategies.
Significance. If the reported patterns hold after addressing methodological gaps, the work has significance for physics education by highlighting how AI tools may embed occupational stereotypes in problem framing. It contributes to discussions on equity in educational materials and provides practical guidance for instructors, extending existing research on AI in STEM education.
major comments (2)
- Methods: The study reports coding 600 problems across five explicit dimensions but provides no information on inter-rater reliability (e.g., Cohen's kappa or percentage agreement) for codes such as hazard presence, agency role, or object ownership. This is load-bearing for the central claim that observed stratification reflects AI model associations rather than consistent but unverified interpretive thresholds by coders.
- Methods: Exact prompt templates are not reproduced, and it is not confirmed that only the occupation label was varied while scenario physics, location, and equipment remained identical. Without this isolation, differences such as exposure hazards concentrated only in Migrant Worker problems could arise from prompt phrasing rather than learned model stereotypes.
minor comments (2)
- Abstract: The claim of 'systematic occupational stratification' would be strengthened by brief mention of any statistical tests or measures of concentration used to support the patterns.
- Discussion: The proposed screening strategies for instructors could include one or two concrete examples drawn from the coded problems to increase actionability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which have strengthened the methodological transparency of our work. We address each major comment below and have incorporated revisions to improve the manuscript.
read point-by-point responses
-
Referee: Methods: The study reports coding 600 problems across five explicit dimensions but provides no information on inter-rater reliability (e.g., Cohen's kappa or percentage agreement) for codes such as hazard presence, agency role, or object ownership. This is load-bearing for the central claim that observed stratification reflects AI model associations rather than consistent but unverified interpretive thresholds by coders.
Authors: We agree that explicit reporting of inter-rater reliability is necessary to support the validity of the coding results. The original submission omitted these details for space reasons, but coding was performed independently by two authors with all disagreements resolved by consensus discussion. We have revised the Methods section to report inter-rater reliability metrics, including Cohen's kappa values above 0.80 for hazard presence, agency role, cognitive role, and object ownership. These additions confirm that the observed occupational patterns arise from the AI outputs rather than coder variability. revision: yes
-
Referee: Methods: Exact prompt templates are not reproduced, and it is not confirmed that only the occupation label was varied while scenario physics, location, and equipment remained identical. Without this isolation, differences such as exposure hazards concentrated only in Migrant Worker problems could arise from prompt phrasing rather than learned model stereotypes.
Authors: We thank the referee for highlighting this important point regarding experimental control. The prompts were intentionally designed so that only the occupation label changed while the physics scenario, location, and equipment descriptions remained fixed across all six occupations. We acknowledge that the exact templates were not included in the initial manuscript. In the revision, we have added the full prompt templates to an appendix and have clarified in the Methods section that the design isolates the occupation variable. This documentation directly addresses the concern that prompt phrasing could explain the reported differences. revision: yes
Circularity Check
Empirical coding study of AI outputs contains no derivation chain or self-referential reductions.
full rationale
The manuscript reports results from generating 600 physics problems via four AI models and applying five fixed coding dimensions to the resulting text. No equations, fitted parameters, ansatzes, or uniqueness theorems appear. Observed patterns (hazard concentration, passive framing, possessive language) are presented as direct counts from the coded corpus rather than quantities defined in terms of the authors' own prior outputs or self-citations. The study is therefore self-contained against external benchmarks; any methodological concerns (prompt wording, inter-rater reliability) affect validity but do not constitute circularity under the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five coding dimensions (hazard presence, hazard type, agency role, cognitive role, object ownership) reliably capture embedded occupational stereotypes in narrative framing.
Reference graph
Works this paper leans on
-
[1]
Z. Hazari et al. Connecting high school physics experi- ences, outcome expectations, physics identity, and physics ca- reer choice: A gender study.Journal of Research in Science Teaching, 47(8):978–1003, 2010
work page 2010
-
[2]
H. Carlone and A. Johnson. Understanding the science experi- ences of successful women of color: Science identity as an ana- lytic lens.Journal of Research in Science Teaching, 44(8):1187– 1218, 2007
work page 2007
-
[3]
B. Gregorcic and A. Pendrill. Chatgpt and the frustrated socrates.Physics Education, 58(3):035021, 2023
work page 2023
-
[4]
E. Kasneci et al. Chatgpt for good? on opportunities and chal- lenges of large language models for education.Learning and Individual Differences, 103:102274, 2023
work page 2023
-
[5]
O. Maroy. Utilizing large language models for physics ed- ucation: Generating and evaluating problems in mechanics. ICERI2025 Proceedings, pages 2819–2825, 2025
work page 2025
-
[6]
W. Yeadon et al. The death of the short-form physics essay in the coming ai revolution.Physics Education, 58(3):035027, 2023
work page 2023
-
[7]
E. Bender et al. On the dangers of stochastic parrots: Can lan- guage models be too big?Proceedings of the 2021 ACM Con- ference on Fairness, Accountability, and Transparency, pages 610–623, 2021
work page 2021
-
[8]
S. Blodgett et al. Language (technology) is power: A critical sur- vey of “bias” in NLP.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454– 5476, 2020
work page 2020
-
[9]
K. Rainey et al. Race and gender differences in how sense of belonging influences decisions to major in stem.International Journal of STEM Education, 5(1):10, 2018. 5
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.