pith. sign in

arxiv: 2606.19168 · v1 · pith:3ZIXJ6FTnew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Pith reviewed 2026-06-26 20:38 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords pretraining alignmentsafety reflectionlarge language modelsdata filteringself-monitoringunsafe behavior generalizationMedSafetyWorld
0
0 comments X

The pith

Regularly inserting short safety reflections into pretraining data builds self-monitoring into language models so they resist composing safe knowledge into unsafe actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that simply filtering or rewriting unsafe content during pretraining is not enough, because models can still assemble benign facts and skills into harmful behaviors. To address this, the authors introduce a method that adds brief safety reflections at regular intervals throughout the pretraining corpus. These reflections are meant to embed a basic self-monitoring habit directly into the language-modeling process, which later post-training can strengthen without harming core abilities. Experiments on 1.7B models trained on FineWeb-Edu show gains in safety classification and lower success rates for both inference-time and finetuning attacks. A controlled synthetic setting called MedSafetyWorld further isolates the problem of unsafe generalization from safe data and demonstrates that the reflection approach outperforms data filtering or rewriting.

Core claim

Safety Reflection Pretraining works by regularly inserting short safety reflections into pretraining corpora. This integrates self-monitoring directly into language modeling, creating a foundational capability that post-training can then reinforce. The method is shown to improve safety classification accuracy and reduce attack success rates on 1.7B models, while also preventing models from acting on unsafe behaviors that they could otherwise generalize from safe data in the MedSafetyWorld environment.

What carries the argument

Safety Reflection Pretraining, the practice of regularly inserting short safety reflections into pretraining corpora to integrate self-monitoring into language modeling.

If this is right

  • Safety classification accuracy rises compared with models trained only on filtered or rewritten data.
  • Success rates drop for both inference-stage attacks and finetuning attacks that try to elicit unsafe outputs.
  • In controlled settings where unsafe behaviors can be composed from safe facts, the reflection method blocks more of those compositions than filtering or rewriting alone.
  • The inserted reflections create a base capability that compatible post-training can build on without disrupting core language modeling performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regular-insertion pattern might be adapted to instill other base habits, such as step-by-step reasoning or source attribution, during pretraining.
  • If the reflections remain effective at larger scales, pretraining runs could retain more of the original data volume while still limiting downstream harm.
  • Post-training methods that assume a model already has some self-monitoring might become more reliable once this pretraining step is standard.

Load-bearing premise

That inserting safety reflections at regular intervals during pretraining will create a self-monitoring habit that becomes part of normal language modeling and can later be strengthened by post-training without harming other capabilities.

What would settle it

Train two 1.7B models on the same safe corpus, one with regular safety reflections and one without, then measure whether the reflection model shows measurably lower rates of unsafe behavior generalization in the MedSafetyWorld environment.

Figures

Figures reproduced from arXiv: 2606.19168 by Jinhan Li, Kaifeng Lyu, Kexian Tang, Yihan Xu, Zhuorui Ye.

Figure 1
Figure 1. Figure 1: Overview of Safety Reflection Pretraining (SRP). The top row and the bottom-left example illustrate our data pipeline on the pretraining corpus: we split each text into segments and insert a short safety reflection after each segment, so that the model learns to regularly judge the safety of its generation. The bottom-right bar chart reports the attack success rate (ASR, lower is better) of SRP against sta… view at source ↗
Figure 2
Figure 2. Figure 2: A demonstration of the task and data structure in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Safety-Utility performance of different methods on [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An example paragraph of augmented pretraining text after hint generation. The safety [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt of GPT-oss-safeguard-20B to generate latent hint. A.2 Evaluation Results We evaluate the model pretrained with safety judgment and hint data (SRP-Hint) on several bench￾marks. The evaluated benchmarks and protocols are the same as in Section 5. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of the augmented SFT data. The safety reflections are highlighted in [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LLM judge prompt for evaluating jailbreak success. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt of LLM win-rate judgement. queries. The generated completions are then evaluated by a heuristic judge that checks for the correct answer. If the model answers a harmful query correctly, it is classified as a jailbreak. C.2.3 Fine-Tuning Attacks To conduct benign fine-tuning attacks, we fine-tune the models on a set of safe reverse-direction data. We then run the finetuned models through our inst… view at source ↗
Figure 9
Figure 9. Figure 9: Examples of SafeLM (Maini et al., 2025) overrefusal problem. E.1.2 Unsafe Behavior can be Generalized Even though SafeLM employs data filtering and data rewriting to make its pretraining corpus safe, it can still generate unsafe content when attacked [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of SafeLM being attacked by GCG ( [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of Baseline model being attacked by GCG and Prefilling attack. The GCG [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of SRP successfully stops when being induced to output unsafe content. The [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: An example of SRP model being attacked by adversarial finetuning. [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Examples of the pretraining data corresponding to the four query types in [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Examples of the SFT data corresponding to the four query types in [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
read the original abstract

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that pretraining-stage alignment for LLMs should extend beyond filtering or rewriting unsafe data to also insert short safety reflections into the pretraining corpora. This is intended to integrate self-monitoring directly into the language modeling objective, creating a foundational capability that post-training can reinforce. Experiments with 1.7B models on FineWeb-Edu report improved safety classification accuracy and lower success rates for inference-stage and finetuning attacks. The authors introduce MedSafetyWorld, a controlled synthetic environment with defined safety rules, and show via ablations that their method outperforms data filtering and rewriting in preventing unsafe generalization from safe data.

Significance. If the central mechanism holds, the work offers a concrete way to shape model behaviors during pretraining rather than only sanitizing inputs, which could complement existing post-training alignment pipelines. The MedSafetyWorld benchmark is a clear strength: it supplies an explicit reasoning structure and safety definition that allows isolation of generalization effects from safe data, providing a falsifiable testbed that is rare in safety papers. The approach is empirically grounded and directly addresses the risk of composing benign capabilities into unsafe outputs.

major comments (2)
  1. [Experiments] Experiments section (1.7B model results and MedSafetyWorld ablations): the central claim that regular safety reflections integrate self-monitoring into the core next-token prediction dynamics (rather than functioning as additional supervised safety examples) is not isolated. The reported gains in classification accuracy and attack resistance could arise from the reflections serving as extra safety supervision; the ablations compare against filtering/rewriting but do not include a control that inserts non-reflective safety statements at the same pretraining frequency to test whether the reflective format and timing are mechanistically required.
  2. [MedSafetyWorld] MedSafetyWorld ablation results: while the environment is well-defined, the paper does not report whether the safety reflections alter the underlying language modeling loss or merely add a parallel safety signal. Without metrics on how the reflections affect next-token prediction on non-safety tokens or on the degree of integration into the model's internal representations (e.g., via probing or activation analysis), it remains unclear whether the method shapes the behaviors acquired from safe data as claimed or simply increases the volume of safety-related tokens.
minor comments (2)
  1. [Abstract] Abstract: quantitative results (accuracy deltas, attack success rates, model sizes, dataset details) are referenced but not supplied; moving at least the key numerical outcomes into the abstract would improve readability.
  2. [Method] The description of how reflections are inserted (frequency, length, exact format) is referenced in the method but would benefit from an explicit example or pseudocode in the main text for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of MedSafetyWorld as a falsifiable testbed. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (1.7B model results and MedSafetyWorld ablations): the central claim that regular safety reflections integrate self-monitoring into the core next-token prediction dynamics (rather than functioning as additional supervised safety examples) is not isolated. The reported gains in classification accuracy and attack resistance could arise from the reflections serving as extra safety supervision; the ablations compare against filtering/rewriting but do not include a control that inserts non-reflective safety statements at the same pretraining frequency to test whether the reflective format and timing are mechanistically required.

    Authors: We agree this is a valid concern and that the current ablations do not fully isolate whether the reflective format itself is required versus any additional safety-related text. The design intent is for the short, regularly inserted reflections to participate directly in next-token prediction and encourage self-monitoring, but without the suggested control we cannot rule out that the gains stem from extra supervision volume. In the revised version we will add an ablation that inserts non-reflective safety statements (e.g., declarative safety rules without reflection phrasing) at identical frequency and measure the same safety metrics. revision: yes

  2. Referee: [MedSafetyWorld] MedSafetyWorld ablation results: while the environment is well-defined, the paper does not report whether the safety reflections alter the underlying language modeling loss or merely add a parallel safety signal. Without metrics on how the reflections affect next-token prediction on non-safety tokens or on the degree of integration into the model's internal representations (e.g., via probing or activation analysis), it remains unclear whether the method shapes the behaviors acquired from safe data as claimed or simply increases the volume of safety-related tokens.

    Authors: Because the reflections are concatenated into the pretraining sequences, they are optimized under the standard next-token prediction loss rather than a separate objective. We did not, however, report per-token loss breakdowns separating safety versus non-safety tokens or perform probing/activation analyses to quantify representational integration. We will add loss metrics on non-safety tokens in the revision; full activation probing would require substantial additional experiments and may be noted as future work if compute constraints prevent completion in the current revision cycle. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances an empirical intervention (regular insertion of safety reflections into pretraining corpora) whose central claim is tested via controlled experiments on 1.7B models and the MedSafetyWorld synthetic benchmark; no equations, parameter fittings, or derivations are present that could reduce the claimed outcome to the input by construction. The proposal is defined independently of the measured safety metrics, and no load-bearing self-citations or uniqueness theorems are invoked to justify the method. The reported advantages over filtering/rewriting therefore rest on external experimental comparison rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that unsafe behaviors emerge from composition of safe data and that safety reflections can counteract this during pretraining; no free parameters or invented theoretical entities are specified.

axioms (1)
  • domain assumption LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors
    This is the core motivation stated in the abstract for why data safety alone is insufficient.

pith-pipeline@v0.9.1-grok · 5786 in / 1176 out tokens · 35655 ms · 2026-06-26T20:38:19.645475+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 10 canonical work pages

  1. [1]

    2025 , url =

    Xiangyu Qi and Ashwinee Panda and Kaifeng Lyu and Xiao Ma and Subhrajit Roy and Ahmad Beirami and Prateek Mittal and Peter Henderson , booktitle =. 2025 , url =

  2. [2]

    2026 , url =

    Kyle O'Brien and Stephen Casper and Quentin Gregory Anthony and Tomek Korbak and Robert Kirk and Xander Davies and Ishan Mishra and Geoffrey Irving and Yarin Gal and Stella Biderman , booktitle =. 2026 , url =

  3. [3]

    2024 , url =

    James O' Neill and Santhosh Subramanian and Eric Lin and Abishek Satish and Vaikkunth Mugunthan , booktitle =. 2024 , url =

  4. [4]

    2601.21571 , archiveprefix =

    Neil Rathi and Alec Radford , year =. 2601.21571 , archiveprefix =

  5. [5]

    2025 , url =

    Pratyush Maini and Sachin Goyal and Dylan Sam and Alexander Robey and Yash Savani and Yiding Jiang and Andy Zou and Matt Fredrikson and Zachary Chase Lipton and J Zico Kolter , booktitle =. 2025 , url =

  6. [6]

    Sam, Dylan and Goyal, Sachin and Maini, Pratyush and Robey, Alexander and Kolter, J Zico , journal =

  7. [7]

    2025 , url =

    OpenAI , journal =. 2025 , url =

  8. [8]

    Chen, Yanda and Tucker, Mycal and Panickssery, Nina and Wang, Tony and Mosconi, Francesco and Gopal, Anjali and Denison, Carson and Petrini, Linda and Leike, Jan and Perez, Ethan and others , journal =

  9. [9]

    2601.21343 , archiveprefix =

    Ellen Xiaoqing Tan and Shehzaad Dhuliawala and Jing Xu and Ping Yu and Sainbayar Sukhbaatar and Jason Weston and Olga Golovneva , year =. 2601.21343 , archiveprefix =

  10. [10]

    2023 , organization =

    Korbak, Tomasz and Shi, Kejian and Chen, Angelica and Bhalerao, Rasika Vinayak and Buckley, Christopher and Phang, Jason and Bowman, Samuel R and Perez, Ethan , booktitle =. 2023 , organization =

  11. [11]

    Xhonneux, Sophie and Dobre, David and Mofakhami, Mehrnaz and Schwinn, Leo and Gidel, Gauthier , booktitle =

  12. [12]

    Yuan, Yuan and Sriskandarajah, Tina and Brakman, Anna-Luisa and Helyar, Alec and Beutel, Alex and Vallone, Andrea and Jain, Saachi , journal =

  13. [13]

    Second Conference on Language Modeling , year =

    Loubna Ben allal and Anton Lozhkov and Elie Bakouch and Gabriel Martin Blazquez and Guilherme Penedo and Lewis Tunstall and Andr. Second Conference on Language Modeling , year =

  14. [14]

    Advances in Neural Information Processing Systems , volume =

    Penedo, Guilherme and Kydl. Advances in Neural Information Processing Systems , volume =

  15. [15]

    Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Christopher Wilhelm and Luca Soldaini and Noah A

    Nathan Lambert and Jacob Morrison and Valentina Pyatkin and Shengyi Huang and Hamish Ivison and Faeze Brahman and Lester James Validad Miranda and Alisa Liu and Nouha Dziri and Xinxi Lyu and Yuling Gu and Saumya Malik and Victoria Graf and Jena D. Hwang and Jiangjiang Yang and Ronan Le Bras and Oyvind Tafjord and Christopher Wilhelm and Luca Soldaini and ...

  16. [16]

    Zhao, Haiquan and Yuan, Chenhan and Huang, Fei and Hu, Xiaomeng and Zhang, Yichang and Yang, An and Yu, Bowen and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang and others , journal =

  17. [17]

    Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John , journal =

  18. [18]

    Gonzalez and Hao Zhang and Ion Stoica , booktitle =

    Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica , booktitle =

  19. [19]

    and Stoica, Ion , booktitle =

    Moritz, Philipp and Nishihara, Robert and Wang, Stephanie and Tumanov, Alexey and Liaw, Richard and Liang, Eric and Elibol, Melih and Yang, Zongheng and Paul, William and Jordan, Michael I. and Stoica, Ion , booktitle =. 2018 , publisher =

  20. [20]

    Lightning AI , title =

  21. [21]

    Zico Kolter and Matt Fredrikson , year =

    Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson , year =. 2307.15043 , archiveprefix =

  22. [22]

    Lyu, Kaifeng and Zhao, Haoyu and Gu, Xinran and Yu, Dingli and Goyal, Anirudh and Arora, Sanjeev , journal =

  23. [23]

    Ji, Jiaming and Liu, Mickel and Dai, Josef and Pan, Xuehai and Zhang, Chi and Bian, Ce and Chen, Boyuan and Sun, Ruiyang and Wang, Yizhou and Yang, Yaodong , journal =

  24. [24]

    R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = jun, year =. doi:10.18653/v1/2024.naacl-long.301 , pages =

  25. [25]

    arXiv preprint arXiv:2311.08370 , year =

    Vidgen, Bertie and Scherrer, Nino and Kirk, Hannah Rose and Qian, Rebecca and Kannappan, Anand and Hale, Scott A and R. arXiv preprint arXiv:2311.08370 , year =

  26. [26]

    2024 , url =

    Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks , booktitle =. 2024 , url =

  27. [27]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Language Models Resist Alignment: Evidence from Data Compression , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  28. [28]

    Maloyan, N and Ekansh, V and Bulat, N and Bislan, A , journal =

  29. [29]

    Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J and Tramer, Florian and others , journal =

  30. [30]

    Choi, Hyeong Kyu and Du, Xuefeng and Li, Yixuan , booktitle =

  31. [31]

    2024 , url =

    Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle =. 2024 , url =

  32. [32]

    2024 , url =

    Xiangyu Qi and Yi Zeng and Tinghao Xie and Pin-Yu Chen and Ruoxi Jia and Prateek Mittal and Peter Henderson , booktitle =. 2024 , url =

  33. [33]

    2501.12948 , archiveprefix =

    DeepSeek-AI , year =. 2501.12948 , archiveprefix =

  34. [34]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =

    Maini, Pratyush and Seto, Skyler and Bai, Richard and Grangier, David and Zhang, Yizhe and Jaitly, Navdeep , editor =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.757 , pages =

  35. [35]

    Eldan, Ronen and Li, Yuanzhi , journal =

  36. [36]

    Ruan, Yangjun and Band, Neil and Maddison, Chris J and Hashimoto, Tatsunori , journal =

  37. [37]

    Kim, Konwoo and Kotha, Suhas and Choi, Yejin and Hashimoto, Tatsunori and Haber, Nick and Liang, Percy , journal =

  38. [38]

    Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others , journal =

  39. [39]

    Liu, Xiaoqun and Liang, Jiacheng and Ye, Muchao and Xi, Zhaohan , journal =

  40. [40]

    2024 , url =

    Federico Bianchi and Mirac Suzgun and Giuseppe Attanasio and Paul Rottger and Dan Jurafsky and Tatsunori Hashimoto and James Zou , booktitle =. 2024 , url =

  41. [41]

    2026 , url =

    Jianwei Li and Jung-Eun Kim , booktitle =. 2026 , url =

  42. [42]

    Pappas and Eric Wong , archiveprefix =

    Patrick Chao and Alexander Robey and Edgar Dobriban and Hamed Hassani and George J. Pappas and Eric Wong , archiveprefix =. arXiv preprint arXiv:2310.08419 , year =

  43. [43]

    34th USENIX Security Symposium (USENIX Security 25) , year =

    Mark Russinovich and Ahmed Salem and Ronen Eldan , title =. 34th USENIX Security Symposium (USENIX Security 25) , year =

  44. [44]

    2024 , url =

    Shengding Hu and Yuge Tu and Xu Han and Ganqu Cui and Chaoqun He and Weilin Zhao and Xiang Long and Zhi Zheng and Yewei Fang and Yuxiang Huang and Xinrong Zhang and Zhen Leng Thai and Chongyi Wang and Yuan Yao and Chenyang Zhao and Jie Zhou and Jie Cai and Zhongwu Zhai and Ning Ding and Chao Jia and Guoyang Zeng and dahai li and Zhiyuan Liu and Maosong Su...

  45. [45]

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  46. [46]

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal =

  47. [47]

    arXiv preprint arXiv:2407.21783 , year =

  48. [48]

    Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

    Soldaini, Luca and Kinney, Rodney and Bhagia, Akshita and Schwenk, Dustin and Atkinson, David and Authur, Russell and Bogin, Ben and Chandu, Khyathi and Dumas, Jennifer and Elazar, Yanai and Hofmann, Valentin and Jha, Ananya and Kumar, Sachin and Lucy, Li and Lyu, Xinxi and Lambert, Nathan and Magnusson, Ian and Morrison, Jacob and Muennighoff, Niklas and...

  49. [49]

    Draganov, Andrew and Dur, Tolga H and Bhongade, Anandmayi and Phuong, Mary , journal =

  50. [50]

    I can't discuss this topic

    Nature , author =. 2026 , pages =. doi:10.1038/s41586-025-09937-5 , abstract =

  51. [51]

    Hubinger, Evan and Denison, Carson and Mu, Jesse and Lambert, Mike and Tong, Meg and MacDiarmid, Monte and Lanham, Tamera and Ziegler, Daniel M and Maxwell, Tim and Cheng, Newton and others , journal =

  52. [52]

    2025 , url =

    Neel Jain and Aditya Shrivastava and Chenyang Zhu and Daben Liu and Alfy Samuel and Ashwinee Panda and Anoop Kumar and Micah Goldblum and Tom Goldstein , booktitle =. 2025 , url =

  53. [53]

    Alagharu, Rishab and Singh, Ishneet Sukhvinder and Shamsudeen, Shaibi and Wu, Zhen and Panda, Ashwinee , journal =

  54. [54]

    Bikel and Jason E Weston and Eric Michael Smith , booktitle =

    Yiming Zhang and Jianfeng Chi and Hailey Nguyen and Kartikeya Upasani and Daniel M. Bikel and Jason E Weston and Eric Michael Smith , booktitle =. 2025 , url =

  55. [55]

    Wang, Zezhong and Yang, Fangkai and Wang, Lu and Zhao, Pu and Wang, Hongru and Chen, Liang and Lin, Qingwei and Wong, Kam-Fai , editor =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = jun, year =. doi:10.18653/v1/2024.naacl-lo...

  56. [56]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =

    Phan, Hoang and Li, Victor and Lei, Qi , editor =. Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =. doi:10.18653/v1/2025.findings-emnlp.503 , pages =

  57. [57]

    Sel, Bilgehan and Li, Dingcheng and Wallis, Phillip and Keshava, Vaishakh and Jin, Ming and Jonnalagadda, Siddhartha Reddy , journal =

  58. [58]

    Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle , journal =

  59. [59]

    Agnihotri, Shashank and Jakubassa, Jonas and Dey, Priyam and Goyal, Sachin and Schiele, Bernt and Radhakrishnan, Venkatesh Babu and Keuper, Margret , journal =

  60. [60]

    Forty-second International Conference on Machine Learning , year =

    Kenneth Li and Yida Chen and Fernanda Vi. Forty-second International Conference on Machine Learning , year =

  61. [61]

    Longpre, Shayne and Yauney, Gregory and Reif, Emily and Lee, Katherine and Roberts, Adam and Zoph, Barret and Zhou, Denny and Wei, Jason and Robinson, Kevin and Mimno, David and Ippolito, Daphne , editor =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume...

  62. [62]

    arXiv preprint arXiv:2509.08653 , year =

    Jiang, Minqi and Ara. arXiv preprint arXiv:2509.08653 , year =

  63. [63]

    2024 , url =

    Luxi He and Mengzhou Xia and Peter Henderson , booktitle =. 2024 , url =

  64. [64]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  65. [65]

    Li, Xing and Zhen, Hui-Ling and Yin, Lihao and Yu, Xianzhi and Dong, Zhenhua and Yuan, Mingxuan , journal =

  66. [66]

    Zhou, Junsheng and Liu, Yu-Shen and Han, Zhizhong , booktitle =

  67. [67]

    Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , booktitle =

  68. [68]

    2310.02949 , archiveprefix =

    Xianjun Yang and Xiao Wang and Qi Zhang and Linda Petzold and William Yang Wang and Xun Zhao and Dahua Lin , year =. 2310.02949 , archiveprefix =

  69. [69]

    Rule Based Rewards for Language Model Safety , url =

    Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian , booktitle =. doi:10.52202/079017-3457 , editor =

  70. [70]

    Qiyuan, Deng and Bai, Xuefeng and Chen, Kehai and Wang, Yaowei and Nie, Liqiang and Zhang, Min , booktitle =

  71. [71]

    Huang, Xinmeng and Li, Shuo and Dobriban, Edgar and Bastani, Osbert and Hassani, Hamed and Ding, Dongsheng , journal =

  72. [72]

    2024 , url =

    Josef Dai and Xuehai Pan and Ruiyang Sun and Jiaming Ji and Xinbo Xu and Mickel Liu and Yizhou Wang and Yaodong Yang , booktitle =. 2024 , url =

  73. [73]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

    Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

  74. [74]

    Neurips Safe Generative AI Workshop 2024 , year =

    The effect of fine-tuning on language model toxicity , author =. Neurips Safe Generative AI Workshop 2024 , year =

  75. [75]

    Wen, Jiaxin and Ke, Pei and Sun, Hao and Zhang, Zhexin and Li, Chengfei and Bai, Jinfeng and Huang, Minlie , booktitle =

  76. [76]

    arXiv preprint arXiv:2505.09388 , year =

  77. [77]

    2025 , organization =

    Lv, Lijia and Zhang, Weigang and Tang, Xuehai and Wen, Jie and Liu, Feng and Han, Jizhong and Hu, Songlin , booktitle =. 2025 , organization =

  78. [78]

    Shilov, Igor and Cloud, Alex and Gema, Aryo Pradipta and Goldman-Wetzler, Jacob and Panickssery, Nina and Sleight, Henry and Jones, Erik and Anil, Cem , journal =

  79. [79]

    Clark, Peter and Cowhey, Isaac and Etzioni, Oren and Khot, Tushar and Sabharwal, Ashish and Schoenick, Carissa and Tafjord, Oyvind , journal =

  80. [80]

    Talmor, Alon and Herzig, Jonathan and Lourie, Nicholas and Berant, Jonathan , booktitle =

Showing first 80 references.