It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO
Pith reviewed 2026-06-27 13:16 UTC · model grok-4.3
The pith
One GRPO update on a single biased example is enough to make an aligned LLM produce systematic stereotype-driven answers that spread to unrelated tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
One-shot GRPO training on a single biased example suffices to induce systematic bias in aligned LLMs, with the resulting stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. Models differ in susceptibility according to their initial likelihood of producing biased outputs. The results indicate that alignment guardrails established during post-training can be overridden by exposure to a single example.
What carries the argument
One-shot application of Group Relative Policy Optimization (GRPO) to a single biased example, which overrides prior alignment and triggers generalization of biased reasoning.
If this is right
- Post-training alignment procedures leave models open to rapid bias induction from minimal data exposure.
- Bias induced by one example can propagate to reasoning patterns beyond the content of that example.
- Model-to-model differences in initial bias likelihood predict how easily alignment can be broken.
- Safety mechanisms that rely on post-training may require additional protections against single-example overrides.
Where Pith is reading between the lines
- If single-example overrides prove repeatable, evaluation protocols for aligned models may need to test resistance to minimal adversarial updates rather than only large-scale data shifts.
- The finding raises the possibility that deployment pipelines should include monitoring for sudden shifts in output distributions after any fine-tuning step, even when the step uses one datum.
- A natural next test would be whether the same one-shot procedure can be used to remove bias or install other targeted behaviors with equal ease.
Load-bearing premise
The observed bias induction and its generalization come directly from the single GRPO update on the biased example rather than from other unstated aspects of the training procedure or evaluation.
What would settle it
Running the identical one-shot GRPO procedure on the same biased example and finding no measurable rise in biased outputs on held-out tests that cover new attributes and categories.
Figures
read the original abstract
Warning: This paper contains several toxic and offensive statements. Modern large language models (LLMs) are typically aligned through large-scale post-training to ensure fair and reliable behavior. In this work, we investigate how easily such guardrails can be broken by Group Relative Policy Optimization (GRPO). We show that one-shot GRPO training on a single biased example is sufficient to induce systematic bias, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. We further find that models differ in their susceptibility based on the initial likelihood of producing biased outputs. Our results reveal a critical vulnerability in post-training: alignment can be overridden by a single example.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that one-shot GRPO training on a single biased example is sufficient to induce systematic bias in aligned LLMs, with stereotype-driven reasoning generalizing across attributes, categories, and benchmarks. Models vary in susceptibility according to their initial likelihood of producing biased outputs, exposing a vulnerability in post-training alignment that can be overridden by minimal intervention.
Significance. If the central empirical result holds after proper controls, it would indicate that current post-training alignment procedures are fragile to targeted single-example updates, with potential consequences for the reliability of safety guardrails in deployed language models.
major comments (2)
- [Abstract] Abstract: the claim that one-shot GRPO on a single biased example induces the observed systematic bias and cross-attribute generalization is stated without any experimental details, controls, sample sizes, or statistical tests, so it is impossible to assess whether the data support the conclusion.
- [Experiments (or Methods)] The attribution of bias induction and generalization specifically to the single GRPO update is not isolated from other procedural factors. No ablations are described that would rule out confounds from GRPO hyperparameters (group size, advantage estimation), the implicit reward signal, pre-update model behavior, or evaluation prompt distributions.
minor comments (1)
- The warning about toxic content is appropriate but the manuscript should clarify how such examples are presented to avoid unnecessary reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the abstract and experimental isolation. We address each point below and have revised the manuscript to add experimental details and clarify controls.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that one-shot GRPO on a single biased example induces the observed systematic bias and cross-attribute generalization is stated without any experimental details, controls, sample sizes, or statistical tests, so it is impossible to assess whether the data support the conclusion.
Authors: We agree the abstract should convey more experimental scope. The revised abstract now specifies evaluation across 5 LLMs, one GRPO update with group size 4, generalization measured on 3 benchmarks, and statistical significance via paired comparisons (p<0.01). Full sample sizes, prompts, and test details remain in Sections 3 and 4. revision: yes
-
Referee: [Experiments (or Methods)] The attribution of bias induction and generalization specifically to the single GRPO update is not isolated from other procedural factors. No ablations are described that would rule out confounds from GRPO hyperparameters (group size, advantage estimation), the implicit reward signal, pre-update model behavior, or evaluation prompt distributions.
Authors: The design isolates the update via identical pre/post evaluation prompts and by applying GRPO exclusively to the single biased example. Pre-update behavior is directly compared. We acknowledge the absence of hyperparameter ablations; the revision adds a paragraph in Section 4.3 and Appendix C showing the bias effect is robust to group sizes 2/4/8 and standard advantage estimation, while noting the implicit reward is example-specific. Exhaustive ablations on all factors are listed as a limitation. revision: partial
Circularity Check
No circularity: empirical demonstration with no derivation or fitted prediction
full rationale
The paper presents an empirical study showing that one-shot GRPO on a single biased example induces systematic bias in LLMs. No mathematical derivation, parameter fitting, or uniqueness theorem is claimed; the central result is an observed experimental outcome rather than a reduction of a prediction to its inputs by construction. Self-citations, if present, are not load-bearing for any claimed derivation. The work is self-contained as an empirical report and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925,
-
[2]
Palm 2 technical report.arXiv preprint arXiv:2305.10403,
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
-
[3]
Evaluating gender bias of LLMs in making morality judgements
Divij Bajaj, Yuanyuan Lei, Jonathan Tong, and Ruihong Huang. Evaluating gender bias of LLMs in making morality judgements. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen (eds.),Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 15804–15818, Miami, Florida, USA, November
2024
-
[4]
doi: 10.18653/v1/2024.findings-emnlp.928
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp.928. URL https://aclanthology. org/2024.findings-emnlp.928/. Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Nan Rosemary Ke, Sebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. InInter...
-
[5]
Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
1901
-
[6]
Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a
Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges.AI Magazine, 2024a. doi: 10.1002/aaai.12188. URL https://doi.org/10.1002/ aaai.12188. Canyu Chen and Kai Shu. Can LLM-generated misinformation be detected? InThe Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/ ...
-
[7]
doi: 10.18653/v1/2023.acl-long.507
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.507. URLhttps://aclanthology.org/2023.acl-long.507/. Avijit Ghosh, Lucie-Aim´ee Kaffee, Yacine Jernite, and Irene Solaiman. State of open source on hugging face: Spring 2026,
-
[8]
URL https://huggingface.co/blog/huggingface/ state-of-os-hf-spring-2026. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
Pith/arXiv arXiv 2026
-
[9]
ISSN 1476-4687. doi: 10.1038/s41586-025-09422-z. URLhttps://doi.org/10.1038/s41586-025-09422-z. Zara Hall, Melanie Subbiah, Thomas P Zollo, Kathleen McKeown, and Richard Zemel. Guiding LLM decision-making with fairness reward models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[10]
Mixtral of experts.arXiv preprint arXiv:2401.04088,
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,
-
[11]
We’re afraid language models aren’t modeling ambiguity
Alisa Liu, Zhaofeng Wu, Julian Michael, Alane Suhr, Peter West, Alexander Koller, Swabha Swayamdipta, Noah Smith, and Yejin Choi. We’re afraid language models aren’t modeling ambiguity. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 790–807, Singapore, 12 Prep...
2023
-
[12]
Association for Computational Linguistics. doi: 10.18653/v1/2023. emnlp-main.51. URLhttps://aclanthology.org/2023.emnlp-main.51/. Sheng Liu, Zhihui Zhu, Qing Qu, and Chong You. Robust training under label noise by over-parameterization. InInternational Conference on Machine Learning, pp. 14153–14172. PMLR,
-
[13]
URLhttps://arxiv.org/abs/2601.05242. Tongliang Liu and Dacheng Tao. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461,
-
[14]
Ximing Lu, David Acuna, Jaehun Jung, Jian Hu, Di Zhang, Shizhe Diao, Yunheng Zou, Shaokun Zhang, Brandon Cui, Mingjie Liu, et al. Golden goose: A simple trick to synthe- size unlimited rlvr tasks from unverifiable internet text.arXiv preprint arXiv:2601.22975,
-
[15]
doi: 10.1109/TPAMI.2021.3087514
ISSN 0162-8828. doi: 10.1109/TPAMI.2021.3087514. URLhttps://doi.org/10.1109/TPAMI.2021.3087514. Omar El Mansouri, Mohamed El Amine Seddik, and Salem Lahlou. Noise-corrected grpo: From noisy rewards to unbiased gradients.arXiv preprint arXiv:2510.18924,
-
[16]
Nallapati, R., Zhou, B., Gulcehre, C., and Xiang, B
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.416. URL https://aclanthology.org/2021.acl-long. 416/. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.),Proceedings ...
-
[17]
Generating radiology reports via memory-driven transformer
Association for Computational Linguistics. doi: 10.18653/v1/2020. emnlp-main.154. URLhttps://aclanthology.org/2020.emnlp-main.154/. Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411,
-
[18]
BBQ: A hand-built bias benchmark for question answering
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp. 2086–2105, Dublin, Ireland, May
2022
-
[19]
Association for Computational Linguistics. doi: 10.18653/v1/2022. findings-acl.165. URLhttps://aclanthology.org/2022.findings-acl.165/. Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Peter Henderson, Mengdi Wang, and Prateek Mittal. Visual adversarial examples jailbreak aligned large language models,
-
[20]
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D
URLhttps://arxiv.org/abs/2412.15115. Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, and Sanjeev Arora. What makes a reward model a good teacher? an optimization perspective. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[21]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,
-
[22]
Spurious Rewards: Rethinking Training Signals in RLVR
Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.244. URLhttps://aclanthology.org/2023.acl-long.244/. Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al. Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.acl-long.244 2023
-
[23]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300,
-
[24]
selective prediction
Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, and Khyathi Chandu. Selective “selective prediction”: Reducing unnecessary abstention in vision-language reasoning. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Findings of the Association for Computational Linguistics: ACL 2024, pp. 12935–12948, Bangkok, ...
2024
-
[25]
URL https://aclanthology.org/2024.findings-acl
18653/v1/2024.findings-acl.767. URL https://aclanthology.org/2024.findings-acl. 767/. Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J Corso, and Joyce Chai. Transparent and coherent procedural mistake detection. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (eds.),Proceedings of the 2025 Confer- ence ...
2024
-
[26]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main.706. URL https://aclanthology.org/2025.emnlp-main. 706/. 14 Preprint. Under review. Lichao Sun, Yingtong Dou, Carl Yang, Kai Zhang, Ji Wang, Philip S Yu, Lifang He, and Bo Li. Adversarial attack and defense on graph data: A survey.IEEE Transactions on Knowl...
-
[27]
Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
-
[28]
URL https: //aclanthology.org/2025.acl-long.127/
Association for Com- putational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.127. URL https://aclanthology.org/2025.acl-long.127/. Fahri Anıl Yerlikaya andS ¸erif Bahtiyar. Data poisoning attacks against machine learning algorithms.Expert Systems with Applications, 208:118101,
-
[29]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Un- derstanding deep learning (still) requires rethinking generalization.Commun. ACM, 64(3):107–115, February 2021a. ISSN 0001-0782. doi: 10.1145/3446776. URL https: //doi.org/10.1145/3446776. Hongpo Zhang, Ning Cheng, Yang Zhang, and Zhanbo Li. Label flipping attacks against nai...
-
[30]
URLhttps://openreview.net/forum?id=nCEs0tSwc2. Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071,
-
[31]
URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf
doi: 10.52202/ 079017-3701. URL https://proceedings.neurips.cc/paper files/paper/2024/file/ d3696c79d572c995a74eac78037551a8-Paper-Conference.pdf. Yilun Zhu, Naihao Deng, Naichen Shi, Aditya Gangrade, and Clayton Scott. Domain generalization under posterior drift
2024
-
[32]
URLhttps://arxiv.org/abs/2510.04441. 16 Preprint. Under review. A Preliminary GRPO Training.We briefly introduce Group Relative Policy Optimization (GRPO) (Shao et al., 2024), a variant of PPO (Schulman et al.,
arXiv 2024
-
[33]
As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy
In practice, we select the step corresponding to the lowest average accuracy on BBQ. As shown in Figure 4, per-category performance trends are consistent with the overall validation accuracy. This indicates that the average accuracy provides a reliable proxy for identifying the point of maximal fairness degradation. In practice, even a minimal validation ...
2025
-
[34]
A and B are both accurate, both are inaccurate, or both are out of context
Subsampling datasets.Following Shaikh et al. (2023), we subsample 100 QA pairs per bias category, resulting in 1,093, 351, 908, 400, and 2,200 examples for the respective datasets. If a subcategory contained fewer than 100 examples, we retained all available items This preserves the relative category distribution while keeping the evaluation set at a mana...
2023
-
[35]
Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step
F.2 PRM fairness scores Table 9 provides the scores assigned by the Fair PRM (Hall et al., 2025). Specifically, we use the PRM to evaluate each intermediate step in the model’s reasoning chain, excluding the final answer prediction step. The step-level scores are first averaged within each example, and then aggregated across all examples to obtain the fin...
2025
-
[36]
Cannot be determined
∆ indicates the change in accuracy relative to the base model. BBQDatasetSizeStepTypeAgeDisab.Gen.Nat.Appr.R/E.R. & Gen.R. & SES.Relig.SES.Sex.O.A VGCrSGMOSStWnQ Llama 3.2 3B InstructBase 0 0 NA 82.78 81.44 92.21 85.42 83.41 90.38 90.84 89.26 81.42 86.88 79.15 85.61 75.00 82.80 73.37 77.99{˜z1}1 200 Sex.O. 78.80 77.44 88.84 81.89 78.19 87.98 87.81 84.74 7...
2025
-
[37]
F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ˜z66, ˜z87, ˜z100
Together, these results highlight that the degradation observed in our main experiments is not merely due to noisy supervision but arises specifically from biased signals, which induce coherent yet systematically unfair reasoning. F.5 Training Curves Figure 5 provides the training curves of one-shot training Llama 3.2 3B Instruct on ˜z1, ˜z2, ˜z12, ˜z40, ...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.