Who Gets to Do Physics? Occupational Stereotypes in AI-Generated Problem Sets

Bilas Paul

arxiv: 2605.19161 · v1 · pith:DAIF7XZRnew · submitted 2026-05-18 · ⚛️ physics.ed-ph · physics.soc-ph

Who Gets to Do Physics? Occupational Stereotypes in AI-Generated Problem Sets

Bilas Paul This is my paper

Pith reviewed 2026-05-20 07:03 UTC · model grok-4.3

classification ⚛️ physics.ed-ph physics.soc-ph

keywords AI-generated problemsoccupational stereotypesphysics educationAI biasnarrative framinghazard in problemsintroductory physicsworkplace roles

0 comments

The pith

AI-generated physics problems assign hazards mainly to migrant and construction workers while reserving ownership language for CEOs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines six hundred introductory physics problems produced by four large AI models when prompted with people in six different jobs. It tracks how each problem frames the person through five simple categories that record danger, who acts, who thinks, and who possesses objects. The patterns show that migrant-worker and construction-worker prompts reliably produce exposure risks and passive injury stories, while CEO prompts alone use language of personal ownership. A reader should care because these small narrative choices can quietly tell students which kinds of people belong in technical work and which ones are expected to encounter danger. The physics itself stays correct, yet the social roles are not neutral.

Core claim

Across the generated problems, hazardous scenarios concentrate in those involving migrant workers and construction workers, exposure-related hazards appear especially often with migrant workers, passive-accident framing occurs in one in eight migrant-worker problems and never for physicists, teachers, or CEOs, and possessive ownership language appears almost exclusively with the CEO.

What carries the argument

Five-dimension coding scheme that records hazard presence, hazard type, agency role, cognitive role, and object ownership in each problem's narrative.

If this is right

Instructors who use AI to create homework should apply a quick checklist for occupational framing before assigning the problems.
AI tools can produce technically accurate physics while still embedding familiar workplace hierarchies in the stories they tell.
Introductory physics courses may unintentionally reinforce ideas about which jobs involve risk and which involve control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same occupational patterns could surface in AI-generated problems for chemistry or biology if similar occupation prompts are used.
Developers might reduce these effects by auditing the training data for associations between job titles and risk or ownership.
A follow-up study could measure whether students notice or internalize the role differences after working with the problems.

Load-bearing premise

Differences in occupational framing arise from the AI models' learned associations rather than from the exact wording of the prompts or from the researchers' application of the five coding categories.

What would settle it

Regenerate the same occupation prompts with deliberately varied wording or with human coders kept unaware of which occupation is named, then check whether the same concentrations of hazards, passive framing, and ownership language still appear.

Figures

Figures reproduced from arXiv: 2605.19161 by Bilas Paul.

**Figure 2.** Figure 2: FIG. 2. Hazard type by persona and platform, restricted to problems [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Passive framings by occupational persona across four AI [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Cognitive role distribution by occupational persona. Bars [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Ownership framing by occupational persona across four AI [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

read the original abstract

As AI-generated problem sets gain traction in introductory physics courses, their technical correctness is well established - but the social assumptions embedded in their framing have gone largely unexamined. This study analyzes 600 introductory physics problems generated by four AI systems - Grok~4, GPT-5.2, Claude Sonnet 4.6, and Gemini 3 Flash - across structured prompts involving occupations (CEO, Physicist, High School Teacher, Nurse, Construction Worker, and Migrant Worker). Problems were coded on five dimensions: hazard presence, hazard type, agency role, cognitive role, and object ownership. While the physics content is technically sound across all platforms, our analysis reveals systematic occupational stratification in narrative framing. Hazardous scenarios were concentrated in Migrant Worker and Construction Worker problems, with exposure-related hazards (electrocution, burns, radiation, heat or chemical exposure) especially concentrated in Migrant Worker problems. Passive-accident framing - the persona as the recipient of an injury - appeared in one in eight Migrant Worker problems and never appeared for the Physicist, Teacher, or CEO. Possessive ownership language was reserved almost exclusively for the CEO. These patterns suggest that AI-generated physics problems can introduce surface-level diversity while reproducing occupational hierarchies in who acts, who owns, and who is placed at risk. We discuss implications for physics teaching and offer simple screening strategies for instructors using AI-generated problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes 600 introductory physics problems generated by four AI systems (Grok 4, GPT-5.2, Claude Sonnet 4.6, and Gemini 3 Flash) using structured prompts involving six occupations. Problems are coded on five dimensions: hazard presence, hazard type, agency role, cognitive role, and object ownership. The analysis finds hazardous scenarios concentrated in Migrant Worker and Construction Worker problems, with exposure-related hazards especially in Migrant Worker problems; passive-accident framing appears in one in eight Migrant Worker problems but never for Physicist, Teacher, or CEO; and possessive ownership language is reserved almost exclusively for the CEO. The paper concludes that AI-generated physics problems can introduce surface-level diversity while reproducing occupational hierarchies in who acts, who owns, and who is placed at risk, and discusses implications for physics teaching along with screening strategies.

Significance. If the reported patterns hold after addressing methodological gaps, the work has significance for physics education by highlighting how AI tools may embed occupational stereotypes in problem framing. It contributes to discussions on equity in educational materials and provides practical guidance for instructors, extending existing research on AI in STEM education.

major comments (2)

Methods: The study reports coding 600 problems across five explicit dimensions but provides no information on inter-rater reliability (e.g., Cohen's kappa or percentage agreement) for codes such as hazard presence, agency role, or object ownership. This is load-bearing for the central claim that observed stratification reflects AI model associations rather than consistent but unverified interpretive thresholds by coders.
Methods: Exact prompt templates are not reproduced, and it is not confirmed that only the occupation label was varied while scenario physics, location, and equipment remained identical. Without this isolation, differences such as exposure hazards concentrated only in Migrant Worker problems could arise from prompt phrasing rather than learned model stereotypes.

minor comments (2)

Abstract: The claim of 'systematic occupational stratification' would be strengthened by brief mention of any statistical tests or measures of concentration used to support the patterns.
Discussion: The proposed screening strategies for instructors could include one or two concrete examples drawn from the coded problems to increase actionability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have strengthened the methodological transparency of our work. We address each major comment below and have incorporated revisions to improve the manuscript.

read point-by-point responses

Referee: Methods: The study reports coding 600 problems across five explicit dimensions but provides no information on inter-rater reliability (e.g., Cohen's kappa or percentage agreement) for codes such as hazard presence, agency role, or object ownership. This is load-bearing for the central claim that observed stratification reflects AI model associations rather than consistent but unverified interpretive thresholds by coders.

Authors: We agree that explicit reporting of inter-rater reliability is necessary to support the validity of the coding results. The original submission omitted these details for space reasons, but coding was performed independently by two authors with all disagreements resolved by consensus discussion. We have revised the Methods section to report inter-rater reliability metrics, including Cohen's kappa values above 0.80 for hazard presence, agency role, cognitive role, and object ownership. These additions confirm that the observed occupational patterns arise from the AI outputs rather than coder variability. revision: yes
Referee: Methods: Exact prompt templates are not reproduced, and it is not confirmed that only the occupation label was varied while scenario physics, location, and equipment remained identical. Without this isolation, differences such as exposure hazards concentrated only in Migrant Worker problems could arise from prompt phrasing rather than learned model stereotypes.

Authors: We thank the referee for highlighting this important point regarding experimental control. The prompts were intentionally designed so that only the occupation label changed while the physics scenario, location, and equipment descriptions remained fixed across all six occupations. We acknowledge that the exact templates were not included in the initial manuscript. In the revision, we have added the full prompt templates to an appendix and have clarified in the Methods section that the design isolates the occupation variable. This documentation directly addresses the concern that prompt phrasing could explain the reported differences. revision: yes

Circularity Check

0 steps flagged

Empirical coding study of AI outputs contains no derivation chain or self-referential reductions.

full rationale

The manuscript reports results from generating 600 physics problems via four AI models and applying five fixed coding dimensions to the resulting text. No equations, fitted parameters, ansatzes, or uniqueness theorems appear. Observed patterns (hazard concentration, passive framing, possessive language) are presented as direct counts from the coded corpus rather than quantities defined in terms of the authors' own prior outputs or self-citations. The study is therefore self-contained against external benchmarks; any methodological concerns (prompt wording, inter-rater reliability) affect validity but do not constitute circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the five-dimensional coding scheme applied to the generated problems and on the assumption that the selected occupations and prompts are sufficient to reveal model-level stereotypes rather than prompt-specific artifacts. No free parameters or invented entities are introduced.

axioms (1)

domain assumption The five coding dimensions (hazard presence, hazard type, agency role, cognitive role, object ownership) reliably capture embedded occupational stereotypes in narrative framing.
The abstract states that problems were coded on these dimensions but does not provide evidence that the scheme was validated or that alternative framings were considered.

pith-pipeline@v0.9.0 · 5777 in / 1429 out tokens · 74135 ms · 2026-05-20T07:03:06.254700+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

Hazari et al

Z. Hazari et al. Connecting high school physics experi- ences, outcome expectations, physics identity, and physics ca- reer choice: A gender study.Journal of Research in Science Teaching, 47(8):978–1003, 2010

work page 2010
[2]

Carlone and A

H. Carlone and A. Johnson. Understanding the science experi- ences of successful women of color: Science identity as an ana- lytic lens.Journal of Research in Science Teaching, 44(8):1187– 1218, 2007

work page 2007
[3]

Gregorcic and A

B. Gregorcic and A. Pendrill. Chatgpt and the frustrated socrates.Physics Education, 58(3):035021, 2023

work page 2023
[4]

Kasneci et al

E. Kasneci et al. Chatgpt for good? on opportunities and chal- lenges of large language models for education.Learning and Individual Differences, 103:102274, 2023

work page 2023
[5]

O. Maroy. Utilizing large language models for physics ed- ucation: Generating and evaluating problems in mechanics. ICERI2025 Proceedings, pages 2819–2825, 2025

work page 2025
[6]

Yeadon et al

W. Yeadon et al. The death of the short-form physics essay in the coming ai revolution.Physics Education, 58(3):035027, 2023

work page 2023
[7]

Bender et al

E. Bender et al. On the dangers of stochastic parrots: Can lan- guage models be too big?Proceedings of the 2021 ACM Con- ference on Fairness, Accountability, and Transparency, pages 610–623, 2021

work page 2021
[8]

Blodgett et al

S. Blodgett et al. Language (technology) is power: A critical sur- vey of “bias” in NLP.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454– 5476, 2020

work page 2020
[9]

Rainey et al

K. Rainey et al. Race and gender differences in how sense of belonging influences decisions to major in stem.International Journal of STEM Education, 5(1):10, 2018. 5

work page 2018

[1] [1]

Hazari et al

Z. Hazari et al. Connecting high school physics experi- ences, outcome expectations, physics identity, and physics ca- reer choice: A gender study.Journal of Research in Science Teaching, 47(8):978–1003, 2010

work page 2010

[2] [2]

Carlone and A

H. Carlone and A. Johnson. Understanding the science experi- ences of successful women of color: Science identity as an ana- lytic lens.Journal of Research in Science Teaching, 44(8):1187– 1218, 2007

work page 2007

[3] [3]

Gregorcic and A

B. Gregorcic and A. Pendrill. Chatgpt and the frustrated socrates.Physics Education, 58(3):035021, 2023

work page 2023

[4] [4]

Kasneci et al

E. Kasneci et al. Chatgpt for good? on opportunities and chal- lenges of large language models for education.Learning and Individual Differences, 103:102274, 2023

work page 2023

[5] [5]

O. Maroy. Utilizing large language models for physics ed- ucation: Generating and evaluating problems in mechanics. ICERI2025 Proceedings, pages 2819–2825, 2025

work page 2025

[6] [6]

Yeadon et al

W. Yeadon et al. The death of the short-form physics essay in the coming ai revolution.Physics Education, 58(3):035027, 2023

work page 2023

[7] [7]

Bender et al

E. Bender et al. On the dangers of stochastic parrots: Can lan- guage models be too big?Proceedings of the 2021 ACM Con- ference on Fairness, Accountability, and Transparency, pages 610–623, 2021

work page 2021

[8] [8]

Blodgett et al

S. Blodgett et al. Language (technology) is power: A critical sur- vey of “bias” in NLP.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454– 5476, 2020

work page 2020

[9] [9]

Rainey et al

K. Rainey et al. Race and gender differences in how sense of belonging influences decisions to major in stem.International Journal of STEM Education, 5(1):10, 2018. 5

work page 2018