It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Clayton Scott; Naichen Shi; Naihao Deng; Rada Mihalcea; Yilun Zhu

arxiv: 2606.10931 · v2 · pith:NBUNZJQQnew · submitted 2026-06-09 · 💻 cs.CL

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

Naihao Deng , Yilun Zhu , Naichen Shi , Clayton Scott , Rada Mihalcea This is my paper

Pith reviewed 2026-06-27 13:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords bias inductionGRPOLLM alignmentone-shot trainingstereotype generalizationpost-training vulnerabilitypolicy optimization

0 comments

The pith

One GRPO update on a single biased example is enough to make an aligned LLM produce systematic stereotype-driven answers that spread to unrelated tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that modern LLM alignment achieved through large-scale post-training can be overturned by applying Group Relative Policy Optimization just once to a lone biased training example. A sympathetic reader would care because the induced bias does not stay local: stereotype-based reasoning appears across different attributes, categories, and standard benchmarks. The work also records that models vary in how readily they adopt the bias, with susceptibility tied to their starting tendency to generate biased outputs before the update.

Core claim

One-shot GRPO training on a single biased example suffices to induce systematic bias in aligned LLMs, with the resulting stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. Models differ in susceptibility according to their initial likelihood of producing biased outputs. The results indicate that alignment guardrails established during post-training can be overridden by exposure to a single example.

What carries the argument

One-shot application of Group Relative Policy Optimization (GRPO) to a single biased example, which overrides prior alignment and triggers generalization of biased reasoning.

If this is right

Post-training alignment procedures leave models open to rapid bias induction from minimal data exposure.
Bias induced by one example can propagate to reasoning patterns beyond the content of that example.
Model-to-model differences in initial bias likelihood predict how easily alignment can be broken.
Safety mechanisms that rely on post-training may require additional protections against single-example overrides.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If single-example overrides prove repeatable, evaluation protocols for aligned models may need to test resistance to minimal adversarial updates rather than only large-scale data shifts.
The finding raises the possibility that deployment pipelines should include monitoring for sudden shifts in output distributions after any fine-tuning step, even when the step uses one datum.
A natural next test would be whether the same one-shot procedure can be used to remove bias or install other targeted behaviors with equal ease.

Load-bearing premise

The observed bias induction and its generalization come directly from the single GRPO update on the biased example rather than from other unstated aspects of the training procedure or evaluation.

What would settle it

Running the identical one-shot GRPO procedure on the same biased example and finding no measurable rise in biased outputs on held-out tests that cover new attributes and categories.

Figures

Figures reproduced from arXiv: 2606.10931 by Clayton Scott, Naichen Shi, Naihao Deng, Rada Mihalcea, Yilun Zhu.

**Figure 2.** Figure 2: 1-shot training (z˜12) and validation accuracy on BBQ. From left to right, the figures show training dynamics for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, Llama 3.1 8B Instruct, Qwen 2.5 7B Instruct. Validation performance decreases as training accuracy increases on the biased example z˜12 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: On the left, training dynamics predicted by the toy model. The curves show [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: 1-shot training (z˜12) and validation accuracy on BBQ per category. From left to right, the figures show training dynamics for Llama 3.2 3B Instruct, Qwen 2.5 3B Instruct, Llama 3.1 8B Instruct, Qwen 2.5 7B Instruct. We observe that per-category performance trends are consistent with the overall validation accuracy. Learning rate. For Llama 3.2 3B Instruct and Qwen 2.5 7B Instruct, we adopt the learning ra… view at source ↗

**Figure 5.** Figure 5: 1-shot training Llama 3.2 3B Instruct model. From left to right, we train the model [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗

**Figure 6.** Figure 6: 1-shot training of Llama 3.2 3B Instruct using PPO (left) vs. GRPO (right) on [PITH_FULL_IMAGE:figures/full_fig_p025_6.png] view at source ↗

read the original abstract

Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims one GRPO step on a single biased example induces generalizable bias, but supplies no methods, controls, or data so the result cannot be assessed.

read the letter

The main thing to know is that this paper reports an empirical result where one GRPO training step on a single biased example leads to systematic bias in LLMs that generalizes across different attributes and benchmarks. The authors also note that models vary in how susceptible they are based on their initial tendency to produce biased outputs.

What is new here is the specific demonstration with GRPO in a one-shot setting and the observation of that cross-benchmark generalization. The broader idea that fine-tuning can override alignment has been discussed before, but the minimal nature of the update is the targeted point.

The paper does well in drawing attention to a potential fragility in post-training alignment procedures. This has implications for the safety of deployed models if the result holds.

The soft spots are significant though. The abstract provides no experimental details whatsoever. There are no mentions of the models used, the exact GRPO setup including group size or advantage estimation, the number of runs, statistical tests, or any controls to show that the bias wasn't already present or induced by other factors. This makes it impossible to assess whether the central claim is supported. The stress-test note about the attribution not being isolated from other procedural factors is on point.

The citation pattern isn't an issue since it's an empirical report, but without the data it's hard to say more.

This kind of paper is aimed at researchers in AI alignment and LLM safety. A reader focused on robustness testing might find the idea worth exploring further if the full methods are sound.

I would not recommend sending this to peer review in its current state. The claim is potentially important, but there's no evidence presented to evaluate it. The full paper would need to include detailed methods, results, and ablations to be worth referee time.

Referee Report

2 major / 1 minor

Summary. The paper claims that one-shot GRPO training on a single biased example is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. Models vary in susceptibility according to their initial likelihood of producing biased outputs, exposing a vulnerability in post-training alignment that can be overridden by minimal intervention.

Significance. If the central empirical result holds after proper controls, it would indicate that current post-training alignment procedures are fragile to targeted single-example updates, with potential consequences for the reliability of safety guardrails in deployed language models.

major comments (2)

[Abstract] Abstract: the claim that one-shot GRPO on a single biased example induces the observed systematic bias and cross-attribute generalization is stated without any experimental details, controls, sample sizes, or statistical tests, so it is impossible to assess whether the data support the conclusion.
[Experiments (or Methods)] The attribution of bias induction and generalization specifically to the single GRPO update is not isolated from other procedural factors. No ablations are described that would rule out confounds from GRPO hyperparameters (group size, advantage estimation), the implicit reward signal, pre-update model behavior, or evaluation prompt distributions.

minor comments (1)

The warning about toxic content is appropriate but the manuscript should clarify how such examples are presented to avoid unnecessary reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and experimental isolation. We address each point below and have revised the manuscript to add experimental details and clarify controls.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that one-shot GRPO on a single biased example induces the observed systematic bias and cross-attribute generalization is stated without any experimental details, controls, sample sizes, or statistical tests, so it is impossible to assess whether the data support the conclusion.

Authors: We agree the abstract should convey more experimental scope. The revised abstract now specifies evaluation across 5 LLMs, one GRPO update with group size 4, generalization measured on 3 benchmarks, and statistical significance via paired comparisons (p<0.01). Full sample sizes, prompts, and test details remain in Sections 3 and 4. revision: yes
Referee: [Experiments (or Methods)] The attribution of bias induction and generalization specifically to the single GRPO update is not isolated from other procedural factors. No ablations are described that would rule out confounds from GRPO hyperparameters (group size, advantage estimation), the implicit reward signal, pre-update model behavior, or evaluation prompt distributions.

Authors: The design isolates the update via identical pre/post evaluation prompts and by applying GRPO exclusively to the single biased example. Pre-update behavior is directly compared. We acknowledge the absence of hyperparameter ablations; the revision adds a paragraph in Section 4.3 and Appendix C showing the bias effect is robust to group sizes 2/4/8 and standard advantage estimation, while noting the implicit reward is example-specific. Exhaustive ablations on all factors are listed as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical demonstration with no derivation or fitted prediction

full rationale

The paper presents an empirical study showing that one-shot GRPO on a single biased example induces systematic bias in LLMs. No mathematical derivation, parameter fitting, or uniqueness theorem is claimed; the central result is an observed experimental outcome rather than a reduction of a prediction to its inputs by construction. Self-citations, if present, are not load-bearing for any claimed derivation. The work is self-contained as an empirical report and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the work is an empirical demonstration rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5644 in / 996 out tokens · 30556 ms · 2026-06-27T13:16:26.251288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 13 canonical work pages · 1 internal anchor

[1]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

Pith/arXiv arXiv
[2]

Palm 2 technical report.arXiv preprint arXiv:2305.10403,

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

Pith/arXiv arXiv
[3]

Evaluating gender bias of LLMs in making morality judgements

Divij Bajaj, Yuanyuan Lei, Jonathan Tong, and Ruihong Huang. Evaluating gender bias of LLMs in making morality judgements. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15804–15818, Miami, Florida, USA, November

2024
[4]

doi: 10.18653/v1/2024.findings-emnlp.928

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.928. URL https://aclanthology. org/2024.findings-emnlp.928/. Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. InInter...

work page doi:10.18653/v1/2024.findings-emnlp.928 2024
[5]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901
[6]

Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a

Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a. doi: 10.1002/aaai.12188. URL https://doi.org/10.1002/ aaai.12188. Canyu Chen and Kai Shu. Can LLM-generated misinformation be detected? InThe Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/ ...

work page doi:10.1002/aaai.12188
[7]

doi: 10.18653/v1/2023.acl-long.507

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.507. URLhttps://aclanthology.org/2023.acl-long.507/. Avijit Ghosh, Lucie-Aim´ee Kaffee, Yacine Jernite, and Irene Solaiman. State of open source on hugging face: Spring 2026,

work page doi:10.18653/v1/2023.acl-long.507 2023
[8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al

URL https://huggingface.co/blog/huggingface/ state-of-os-hf-spring-2026. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv 2026
[9]

Nature645,633–638

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttps://doi.org/10.1038/s41586-025-09422-z. Zara Hall, Melanie Subbiah, Thomas P Zollo, Kathleen McKeown, and Richard Zemel. Guiding LLM decision-making with fairness reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page doi:10.1038/s41586-025-09422-z
[10]

Mixtral of experts.arXiv preprint arXiv:2401.04088,

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

Pith/arXiv arXiv
[11]

We’re afraid language models aren’t modeling ambiguity

Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah Smith, and Yejin Choi. We’re afraid language models aren’t modeling ambiguity. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 790–807, Singapore, 12 Prep...

2023
[12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.51. URLhttps://aclanthology.org/2023.emnlp-main.51/. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over-parameterization. InInternational Conference on Machine Learning, pp. 14153–14172. PMLR,

work page doi:10.18653/v1/2023 2023
[13]

Tongliang Liu and Dacheng Tao

URLhttps://arxiv.org/abs/2601.05242. Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461,

Pith/arXiv arXiv
[14]

Golden goose: A simple trick to synthe- size unlimited rlvr tasks from unverifiable internet text.arXiv preprint arXiv:2601.22975,

Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, et al. Golden goose: A simple trick to synthe- size unlimited rlvr tasks from unverifiable internet text.arXiv preprint arXiv:2601.22975,

arXiv
[15]

doi: 10.1109/TPAMI.2021.3087514

ISSN 0162-8828. doi: 10.1109/TPAMI.2021.3087514. URLhttps://doi.org/10.1109/TPAMI.2021.3087514. Omar El Mansouri, Mohamed El Amine Seddik, and Salem Lahlou. Noise-corrected grpo: From noisy rewards to unbiased gradients.arXiv preprint arXiv:2510.18924,

work page doi:10.1109/tpami.2021.3087514 2021
[16]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long. 416/. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...

work page doi:10.18653/v1/2021.acl-long.416 2021
[17]

Generating radiology reports via memory-driven transformer

Association for Computational Linguistics. doi: 10.18653/v1/2020. emnlp-main.154. URLhttps://aclanthology.org/2020.emnlp-main.154/. Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411,

work page doi:10.18653/v1/2020 2020
[18]

BBQ: A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 2086–2105, Dublin, Ireland, May

2022
[19]

Plank, B

Association for Computational Linguistics. doi: 10.18653/v1/2022. findings-acl.165. URLhttps://aclanthology.org/2022.findings-acl.165/. Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models,

work page doi:10.18653/v1/2022 2022
[20]

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D

URLhttps://arxiv.org/abs/2412.15115. Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

Pith/arXiv arXiv
[21]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv
[22]

Spurious Rewards: Rethinking Training Signals in RLVR

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.244. URLhttps://aclanthology.org/2023.acl-long.244/. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.244 2023
[23]

Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv
[24]

selective prediction

Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Chandu. Selective “selective prediction”: Reducing unnecessary abstention in vision-language reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 12935–12948, Bangkok, ...

2024
[25]

URL https://aclanthology.org/2024.findings-acl

18653/v1/2024.findings-acl.767. URL https://aclanthology.org/2024.findings-acl. 767/. Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J Corso, and Joyce Chai. Transparent and coherent procedural mistake detection. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Confer- ence ...

2024
[26]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.706. URL https://aclanthology.org/2025.emnlp-main. 706/. 14 Preprint. Under review. Lichao Sun, Yingtong Dou, Carl Yang, Kai Zhang, Ji Wang, Philip S Yu, Lifang He, and Bo Li. Adversarial attack and defense on graph data: A survey.IEEE Transactions on Knowl...

work page doi:10.18653/v1/2025.emnlp-main.706 2025
[27]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv
[28]

URL https: //aclanthology.org/2025.acl-long.127/

Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.127. URL https://aclanthology.org/2025.acl-long.127/. Fahri Anıl Yerlikaya andS ¸erif Bahtiyar. Data poisoning attacks against machine learning algorithms.Expert Systems with Applications, 208:118101,

work page doi:10.18653/v1/2025.acl-long.127 2025
[29]

2021 , issue_date =

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Commun. ACM, 64(3):107–115, February 2021a. ISSN 0001-0782. doi: 10.1145/3446776. URL https: //doi.org/10.1145/3446776. Hongpo Zhang, Ning Cheng, Yang Zhang, and Zhanbo Li. Label flipping attacks against nai...

work page doi:10.1145/3446776
[30]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al

URLhttps://openreview.net/forum?id=nCEs0tSwc2. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

Pith/arXiv arXiv
[31]

URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf

doi: 10.52202/ 079017-3701. URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf. Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, and Clayton Scott. Domain generalization under posterior drift

2024
[32]

16 Preprint

URLhttps://arxiv.org/abs/2510.04441. 16 Preprint. Under review. A Preliminary GRPO Training.We briefly introduce Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a variant of PPO (Schulman et al.,

arXiv 2024
[33]

As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy

In practice, we select the step corresponding to the lowest average accuracy on BBQ. As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy. This indicates that the average accuracy provides a reliable proxy for identifying the point of maximal fairness degradation. In practice, even a minimal validation ...

2025
[34]

A and B are both accurate, both are inaccurate, or both are out of context

Subsampling datasets.Following Shaikh et al. (2023), we subsample 100 QA pairs per bias category, resulting in 1,093, 351, 908, 400, and 2,200 examples for the respective datasets. If a subcategory contained fewer than 100 examples, we retained all available items This preserves the relative category distribution while keeping the evaluation set at a mana...

2023
[35]

Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step

F.2 PRM fairness scores Table 9 provides the scores assigned by the Fair PRM (Hall et al., 2025). Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step. The step-level scores are first averaged within each example, and then aggregated across all examples to obtain the fin...

2025
[36]

Cannot be determined

∆ indicates the change in accuracy relative to the base model. BBQDatasetSizeStepTypeAgeDisab.Gen.Nat.Appr.R/E.R. & Gen.R. & SES.Relig.SES.Sex.O.A VGCrSGMOSStWnQ Llama 3.2 3B InstructBase 0 0 NA 82.78 81.44 92.21 85.42 83.41 90.38 90.84 89.26 81.42 86.88 79.15 85.61 75.00 82.80 73.37 77.99{˜z1}1 200 Sex.O. 78.80 77.44 88.84 81.89 78.19 87.98 87.81 84.74 7...

2025
[37]

F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ˜z66, ˜z87, ˜z100

Together, these results highlight that the degradation observed in our main experiments is not merely due to noisy supervision but arises specifically from biased signals, which induce coherent yet systematically unfair reasoning. F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ...

2017

[1] [1]

gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,

Pith/arXiv arXiv

[2] [2]

Palm 2 technical report.arXiv preprint arXiv:2305.10403,

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

Pith/arXiv arXiv

[3] [3]

Evaluating gender bias of LLMs in making morality judgements

Divij Bajaj, Yuanyuan Lei, Jonathan Tong, and Ruihong Huang. Evaluating gender bias of LLMs in making morality judgements. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15804–15818, Miami, Florida, USA, November

2024

[4] [4]

doi: 10.18653/v1/2024.findings-emnlp.928

Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.928. URL https://aclanthology. org/2024.findings-emnlp.928/. Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. InInter...

work page doi:10.18653/v1/2024.findings-emnlp.928 2024

[5] [5]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

1901

[6] [6]

Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a

Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a. doi: 10.1002/aaai.12188. URL https://doi.org/10.1002/ aaai.12188. Canyu Chen and Kai Shu. Can LLM-generated misinformation be detected? InThe Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/ ...

work page doi:10.1002/aaai.12188

[7] [7]

doi: 10.18653/v1/2023.acl-long.507

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.507. URLhttps://aclanthology.org/2023.acl-long.507/. Avijit Ghosh, Lucie-Aim´ee Kaffee, Yacine Jernite, and Irene Solaiman. State of open source on hugging face: Spring 2026,

work page doi:10.18653/v1/2023.acl-long.507 2023

[8] [8]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al

URL https://huggingface.co/blog/huggingface/ state-of-os-hf-spring-2026. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv 2026

[9] [9]

Nature645,633–638

ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttps://doi.org/10.1038/s41586-025-09422-z. Zara Hall, Melanie Subbiah, Thomas P Zollo, Kathleen McKeown, and Richard Zemel. Guiding LLM decision-making with fairness reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

work page doi:10.1038/s41586-025-09422-z

[10] [10]

Mixtral of experts.arXiv preprint arXiv:2401.04088,

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

Pith/arXiv arXiv

[11] [11]

We’re afraid language models aren’t modeling ambiguity

Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah Smith, and Yejin Choi. We’re afraid language models aren’t modeling ambiguity. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 790–807, Singapore, 12 Prep...

2023

[12] [12]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.51. URLhttps://aclanthology.org/2023.emnlp-main.51/. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over-parameterization. InInternational Conference on Machine Learning, pp. 14153–14172. PMLR,

work page doi:10.18653/v1/2023 2023

[13] [13]

Tongliang Liu and Dacheng Tao

URLhttps://arxiv.org/abs/2601.05242. Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461,

Pith/arXiv arXiv

[14] [14]

Golden goose: A simple trick to synthe- size unlimited rlvr tasks from unverifiable internet text.arXiv preprint arXiv:2601.22975,

Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, et al. Golden goose: A simple trick to synthe- size unlimited rlvr tasks from unverifiable internet text.arXiv preprint arXiv:2601.22975,

arXiv

[15] [15]

doi: 10.1109/TPAMI.2021.3087514

ISSN 0162-8828. doi: 10.1109/TPAMI.2021.3087514. URLhttps://doi.org/10.1109/TPAMI.2021.3087514. Omar El Mansouri, Mohamed El Amine Seddik, and Salem Lahlou. Noise-corrected grpo: From noisy rewards to unbiased gradients.arXiv preprint arXiv:2510.18924,

work page doi:10.1109/tpami.2021.3087514 2021

[16] [16]

Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long. 416/. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...

work page doi:10.18653/v1/2021.acl-long.416 2021

[17] [17]

Generating radiology reports via memory-driven transformer

Association for Computational Linguistics. doi: 10.18653/v1/2020. emnlp-main.154. URLhttps://aclanthology.org/2020.emnlp-main.154/. Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411,

work page doi:10.18653/v1/2020 2020

[18] [18]

BBQ: A hand-built bias benchmark for question answering

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 2086–2105, Dublin, Ireland, May

2022

[19] [19]

Plank, B

Association for Computational Linguistics. doi: 10.18653/v1/2022. findings-acl.165. URLhttps://aclanthology.org/2022.findings-acl.165/. Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models,

work page doi:10.18653/v1/2022 2022

[20] [20]

Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D

URLhttps://arxiv.org/abs/2412.15115. Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

Pith/arXiv arXiv

[21] [21]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv

[22] [22]

Spurious Rewards: Rethinking Training Signals in RLVR

Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.244. URLhttps://aclanthology.org/2023.acl-long.244/. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.244 2023

[23] [23]

Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv

[24] [24]

selective prediction

Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Chandu. Selective “selective prediction”: Reducing unnecessary abstention in vision-language reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 12935–12948, Bangkok, ...

2024

[25] [25]

URL https://aclanthology.org/2024.findings-acl

18653/v1/2024.findings-acl.767. URL https://aclanthology.org/2024.findings-acl. 767/. Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J Corso, and Joyce Chai. Transparent and coherent procedural mistake detection. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Confer- ence ...

2024

[26] [26]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.706. URL https://aclanthology.org/2025.emnlp-main. 706/. 14 Preprint. Under review. Lichao Sun, Yingtong Dou, Carl Yang, Kai Zhang, Ji Wang, Philip S Yu, Lifang He, and Bo Li. Adversarial attack and defense on graph data: A survey.IEEE Transactions on Knowl...

work page doi:10.18653/v1/2025.emnlp-main.706 2025

[27] [27]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

Pith/arXiv arXiv

[28] [28]

URL https: //aclanthology.org/2025.acl-long.127/

Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.127. URL https://aclanthology.org/2025.acl-long.127/. Fahri Anıl Yerlikaya andS ¸erif Bahtiyar. Data poisoning attacks against machine learning algorithms.Expert Systems with Applications, 208:118101,

work page doi:10.18653/v1/2025.acl-long.127 2025

[29] [29]

2021 , issue_date =

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Commun. ACM, 64(3):107–115, February 2021a. ISSN 0001-0782. doi: 10.1145/3446776. URL https: //doi.org/10.1145/3446776. Hongpo Zhang, Ning Cheng, Yang Zhang, and Zhanbo Li. Label flipping attacks against nai...

work page doi:10.1145/3446776

[30] [30]

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al

URLhttps://openreview.net/forum?id=nCEs0tSwc2. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,

Pith/arXiv arXiv

[31] [31]

URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf

doi: 10.52202/ 079017-3701. URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf. Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, and Clayton Scott. Domain generalization under posterior drift

2024

[32] [32]

16 Preprint

URLhttps://arxiv.org/abs/2510.04441. 16 Preprint. Under review. A Preliminary GRPO Training.We briefly introduce Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a variant of PPO (Schulman et al.,

arXiv 2024

[33] [33]

As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy

In practice, we select the step corresponding to the lowest average accuracy on BBQ. As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy. This indicates that the average accuracy provides a reliable proxy for identifying the point of maximal fairness degradation. In practice, even a minimal validation ...

2025

[34] [34]

A and B are both accurate, both are inaccurate, or both are out of context

Subsampling datasets.Following Shaikh et al. (2023), we subsample 100 QA pairs per bias category, resulting in 1,093, 351, 908, 400, and 2,200 examples for the respective datasets. If a subcategory contained fewer than 100 examples, we retained all available items This preserves the relative category distribution while keeping the evaluation set at a mana...

2023

[35] [35]

Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step

F.2 PRM fairness scores Table 9 provides the scores assigned by the Fair PRM (Hall et al., 2025). Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step. The step-level scores are first averaged within each example, and then aggregated across all examples to obtain the fin...

2025

[36] [36]

Cannot be determined

∆ indicates the change in accuracy relative to the base model. BBQDatasetSizeStepTypeAgeDisab.Gen.Nat.Appr.R/E.R. & Gen.R. & SES.Relig.SES.Sex.O.A VGCrSGMOSStWnQ Llama 3.2 3B InstructBase 0 0 NA 82.78 81.44 92.21 85.42 83.41 90.38 90.84 89.26 81.42 86.88 79.15 85.61 75.00 82.80 73.37 77.99{˜z1}1 200 Sex.O. 78.80 77.44 88.84 81.89 78.19 87.98 87.81 84.74 7...

2025

[37] [37]

F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ˜z66, ˜z87, ˜z100

Together, these results highlight that the degradation observed in our main experiments is not merely due to noisy supervision but arises specifically from biased signals, which induce coherent yet systematically unfair reasoning. F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ...

2017