Constitutional On-Policy Safe Distillation

Guoyu Wang; Kun Yang; Ming Wen; Shiwen Cui; Xiang Zheng; Xingjun Ma; Yu-Gang Jiang; Yuhao Sun; Yunhao Feng; Yuxuan Liu

arxiv: 2606.03089 · v2 · pith:3X7F2TFYnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Constitutional On-Policy Safe Distillation

Ming Wen , Yuxuan Liu , Kun Yang , Yunhao Feng , Zhuoer Xu , Yuhao Sun , Shiwen Cui , Xiang Zheng

show 3 more authors

Guoyu Wang Xingjun Ma Yu-Gang Jiang

This is my paper

Pith reviewed 2026-06-28 11:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationsafety alignmentconstitutional AIsafe distillationlanguage model post-traininghelpfulness trade-offgeometric leakage

0 comments

The pith

COPSD adds a Cross-SFT cold-start step to stop on-policy safety distillation from collapsing into short conservative responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety alignment via on-policy self-distillation collapses because constitutional conditioning shrinks the teacher to short, overly safe outputs and Reverse KL then leaks that pressure into reduced expressiveness. It formalizes the problem as geometric leakage across non-orthogonal dimensions in semantic space. The proposed fix first runs Cross-SFT to calibrate the teacher, then performs constitution-conditioned on-policy distillation. Experiments across 12 benchmarks show this yields a stronger safety-helpfulness balance and lowers the safety tax on general reasoning tasks. A reader would care because safety training routinely damages model capabilities, and a lightweight calibration step appears to decouple the two without extra data.

Core claim

Safety OPSD collapses because constitutional conditioning contracts the teacher distribution toward short conservative responses while Reverse KL amplifies the contraction into lower expressiveness; this effect is formalized as geometric leakage under safety boundaries in non-orthogonal semantic space. COPSD corrects it by first applying a Cross-SFT cold-start to calibrate the teacher and then running constitution-conditioned on-policy distillation, producing a stronger safety-helpfulness trade-off on 12 benchmarks while reducing the safety tax on reasoning ability.

What carries the argument

The Cross-SFT cold-start calibration step that widens the teacher distribution before constitution-conditioned on-policy distillation begins.

If this is right

COPSD keeps higher response expressiveness while still satisfying constitutional safety rules.
The safety tax on general reasoning tasks drops compared with standard safety OPSD.
The method works with high-level constitutions rather than explicit target answers.
The same two-stage pattern can be applied whenever privileged conditioning risks contracting the teacher distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The non-orthogonal semantic space framing suggests that safety and capability dimensions may be entangled in other alignment methods that use conditioning.
Cross-SFT calibration might be reusable as a general pre-step for any distillation that risks mode collapse under constraints.
If the leakage mechanism holds, similar cold-start techniques could reduce capability loss in other preference-based training regimes.

Load-bearing premise

The contraction of the teacher distribution seen in the pilot study is the dominant cause of collapse and generalizes to full training so that the Cross-SFT step reliably fixes it.

What would settle it

Running the same on-policy distillation without the Cross-SFT cold-start on the 12 benchmarks and measuring whether the safety-helpfulness trade-off and reasoning scores match or exceed those of COPSD.

read the original abstract

On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COPSD adds a Cross-SFT cold-start to on-policy distillation to counter safety-induced collapse, but the pilot mechanism lacks confirmation that it drives the full-regime results.

read the letter

The paper's core move is to treat safety alignment as a setting where on-policy self-distillation collapses because constitutional prompts plus reverse KL shrink the teacher toward short, conservative outputs. They label this geometric leakage across non-orthogonal dimensions and respond with a two-stage fix: first a Cross-SFT cold-start to calibrate the teacher, then the constitution-conditioned distillation. That combination is the concrete novelty; prior OPSD work already noted collapse in reasoning tasks, but the safety-specific diagnosis and the explicit cold-start step are new variants.

The approach is worth looking at because it directly targets a practical pain point—safety fine-tuning often trades off expressiveness—and the abstract claims the method improves the safety-helpfulness frontier while cutting the reasoning tax across 12 benchmarks. If the gains are real and the baselines are standard, that would be useful incremental evidence for post-training pipelines.

The main weakness is that the causal story rests on a pilot observation whose dominance is not shown to survive once full on-policy training begins. No ablations are described that isolate the cold-start by removing it while holding everything else fixed, so it is unclear whether the reported collapse is actually fixed by that step or by other unmentioned factors. The abstract also supplies no baseline details, metric definitions, or significance numbers, which makes the strength of the 12-benchmark claim impossible to judge from the given text.

This is for people already working on distillation or safety alignment who want to see one concrete attempt to stabilize OPSD under constitutions. It is not yet at the point where I would cite the numbers, but the problem framing is clear enough that a serious referee could check the missing controls and data. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper claims that on-policy self-distillation for safety alignment collapses due to constitutional conditioning contracting the teacher distribution toward short responses, with Reverse KL amplifying this into reduced expressiveness; this is formalized as geometric leakage in a non-orthogonal semantic space. It proposes Constitutional On-Policy Safe Distillation (COPSD) consisting of a Cross-SFT cold-start calibration of the teacher followed by constitution-conditioned on-policy distillation. Experiments on 12 benchmarks are reported to show a stronger safety-helpfulness trade-off than baselines and reduced safety tax on general reasoning ability.

Significance. If the empirical results and the attributed mechanism hold after verification, the work would offer a practical two-stage procedure for improving the safety-helpfulness frontier in LLM post-training without the typical reasoning degradation, addressing a documented failure mode of dense distillation under high-level constitutional guidance.

major comments (3)

[§3] §3 (Analysis of Collapse and Geometric Leakage): The pilot-study observation that constitutional conditioning plus Reverse KL produces contraction into reduced expressiveness is asserted to be the dominant cause of collapse in the full regime, yet no quantitative measurements of geometric leakage (e.g., distribution contraction metrics or expressiveness proxies) are reported once full on-policy training on the 12 benchmarks begins; this attribution is load-bearing for the claim that Cross-SFT is the necessary corrective step.
[§4] §4 (Experiments): No ablation studies are described that train the full COPSD pipeline with the Cross-SFT step removed (while holding all other factors fixed) to test whether the reported collapse reappears; without such controls the causal efficacy of the two-stage procedure remains an unverified extrapolation from the pilot study.
[§4.1] §4.1 (Benchmark Details): The abstract states results on 12 benchmarks but the manuscript supplies no explicit list of baselines, exact metrics (e.g., safety/helpfulness scores, statistical significance tests), or variance across runs, preventing verification that the stronger trade-off is robust rather than an artifact of evaluation choices.

minor comments (2)

Abstract: grammatical error ('our pilot study show' should read 'shows').
Notation: the term 'geometric leakage' is introduced without an accompanying equation or precise definition in the early sections, making it difficult to distinguish from standard distribution-shift effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our analysis and experimental validation. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (Analysis of Collapse and Geometric Leakage): The pilot-study observation that constitutional conditioning plus Reverse KL produces contraction into reduced expressiveness is asserted to be the dominant cause of collapse in the full regime, yet no quantitative measurements of geometric leakage (e.g., distribution contraction metrics or expressiveness proxies) are reported once full on-policy training on the 12 benchmarks begins; this attribution is load-bearing for the claim that Cross-SFT is the necessary corrective step.

Authors: The pilot study was designed to isolate and quantify the contraction mechanism under controlled conditions before scaling to the full regime. While the full experiments demonstrate the efficacy of COPSD through end-task metrics, we agree that explicit contraction and expressiveness metrics during the complete on-policy runs would strengthen the causal attribution. In the revision we will add these measurements (e.g., token-length distributions, semantic variance proxies, and KL-divergence trends) on the benchmark training trajectories. revision: yes
Referee: [§4] §4 (Experiments): No ablation studies are described that train the full COPSD pipeline with the Cross-SFT step removed (while holding all other factors fixed) to test whether the reported collapse reappears; without such controls the causal efficacy of the two-stage procedure remains an unverified extrapolation from the pilot study.

Authors: We acknowledge that a direct ablation removing only the Cross-SFT cold-start (while keeping the constitution-conditioned distillation stage identical) would provide stronger causal evidence. The current manuscript relies on the pilot plus the full-pipeline results for this claim. We will add the requested ablation in the revised experiments section, reporting the safety-helpfulness trade-off and reasoning degradation when Cross-SFT is omitted. revision: yes
Referee: [§4.1] §4.1 (Benchmark Details): The abstract states results on 12 benchmarks but the manuscript supplies no explicit list of baselines, exact metrics (e.g., safety/helpfulness scores, statistical significance tests), or variance across runs, preventing verification that the stronger trade-off is robust rather than an artifact of evaluation choices.

Authors: Section 4.1 and the appendix already enumerate the 12 benchmarks, the full set of baselines, and the primary metrics. However, to improve accessibility we will add a consolidated table in the main text that explicitly lists every benchmark, the precise safety and helpfulness metrics used, statistical significance tests, and run-to-run variance (standard deviations across three seeds). revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent benchmark comparisons

full rationale

The paper presents an empirical method (COPSD) motivated by a pilot observation of collapse under constitutional conditioning + Reverse KL, formalized descriptively as geometric leakage. No equations, derivations, or fitted parameters are shown reducing a claimed result to the inputs by construction. The central results are performance numbers on 12 external benchmarks, with no self-citation load-bearing the core argument and no renaming of known results as novel unification. The pilot-to-full-regime extrapolation is an untested assumption but does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that safety alignment differs structurally from verifiable reasoning tasks and that the observed contraction effect is the primary obstacle addressed by the cold-start step. No free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Safety alignment is guided by high-level constitutions rather than explicit target answers, making dense distillation a natural setting.
Stated directly in the abstract as the distinction from prior OPSD work on verifiable reasoning.

pith-pipeline@v0.9.1-grok · 5749 in / 1301 out tokens · 22345 ms · 2026-06-28T11:45:35.072227+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 35 canonical work pages · 19 internal anchors

[1]

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, and Junfeng Fang

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URLhttps: //arxiv.org/abs/2306.13649

work page arXiv 2024
[2]

A general theoretical paradigm to understand learning from human preferences, 2023

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023. URL https://arxiv.org/abs/2310.12036

work page arXiv 2023
[3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Training a helpful and harmless assistant with reinforcement learning from human feedback,

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
[5]

URLhttps://arxiv.org/abs/2204.05862

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Y Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In International Conferenceon Learning Representations, volume 2024, pages 34196–34216, 2024

2024
[8]

Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet. Autoregressive language models are secretly energy-based models: Insights into the lookahead capabilities of next-token prediction,
[9]

URLhttps://arxiv.org/abs/2512.15605

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Safe rlhf: Safe reinforcement learning from human feedback

Juntao Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational ConferenceonLearning Representations, volume 2024, pages 50750–50777, 2024

2024
[11]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025. URLhttps://arxiv.org/abs/2412. 16339

2025
[13]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[14]

Segment policy optimization: Effective segment-level credit assignment in rl for large language models, 2025

Yiran Guo, Lĳie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models, 2025. URLhttps://arxiv.org/abs/2505.23564

work page arXiv 2025
[15]

Vlsbench: Unveiling visual leakage in multimodal safety

Xuhao Hu, Dongrui Liu, Hao Li, Xuanjing Huang, and Jing Shao. Vlsbench: Unveiling visual leakage in multimodal safety. arXiv preprintarXiv:2411.19939, 2024

work page arXiv 2024
[16]

Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025. URLhttps://arxiv.org/abs/2503. 00555

2025
[17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zĳie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering

DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition,pages 6700–6709, 2019

2019
[19]

Reinforcementlearningviaself-distillation,

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld,ThomasKleineBuening,CarlosGuestrin,andAndreasKrause. Reinforcementlearningviaself-distillation,
[20]

URLhttps://arxiv.org/abs/2601.20802

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback

Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, and Yaodong Yang. Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback. InThe Thirty-ninthAnnual Conferenceon NeuralInformationProcessing Systems,
[22]

URLhttps://openreview.net/forum?id=OIH3T5ZPBW
[23]

Entropy-aware on-policy distillation of language models, 2026

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models, 2026. URLhttps://arxiv.org/abs/2603. 07079

2026
[24]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. URL https://arxiv.org/abs/2603.24472

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation, 2026. URLhttps://arxiv.org/abs/2603.11137

work page arXiv 2026
[26]

Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026

Yuxuan Liu, Yuntian Shi, Kun Wang, Haoting Shen, and Kun Yang. Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026. URLhttps://arxiv.org/abs/2602.03263

work page arXiv 2026
[27]

https://thinkingmachines.ai/blog/on-policy-distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation.ThinkingMachinesLab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025
[28]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conferenceon NeuralInformationProcessingSystems(NeurIPS), 2022. 15

2022
[29]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conferenceon Learning Representations(ICLR), 2024

2024
[30]

Mitigating the safety alignment tax with null-space constrained policy optimization

Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, and Jia Li. Mitigating the safety alignment tax with null-space constrained policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GFyVxtyMvq

2026
[31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesinneuralinformationprocessingsystems, 35:27730–27744, 2022

2022
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
[36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXivpreprintarXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, and Xuanjing Huang. Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024. URLhttps://arxiv.org/abs/2406. 15279

work page arXiv 2024
[39]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

WeiyunWang,ZheChen,WenhaiWang,YueCao,YangzhouLiu,ZhangweiGao,JinguoZhu,XizhouZhu,LeweiLu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025. URLhttps://arxiv.org/abs/2411.10442

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846, 2023

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently.arXiv preprintarXiv:2303.03846, 2023

work page arXiv 2023
[41]

Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026

Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, and Yuedong Xu. Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026. URLhttps://arxiv.org/abs/2603.13292

work page arXiv 2026
[42]

as an ai language model, i cannot

Joel Wester, Tim Schrills, Henning Pohl, and Niels Van Berkel. “as an ai language model, i cannot”: Investigating llm denials of user requests. InProceedingsofthe 2024CHIConferenceonHumanFactorsinComputingSystems, pages 1–14, 2024

2024
[43]

Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026

Yingsha Xie, Tiansheng Huang, Enneng Yang, Rui Min, Wenjie Lu, Xiaochun Cao, Naiqiang Tan, and Li Shen. Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026. URLhttps://arxiv. org/abs/2602.02136

work page arXiv 2026
[44]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

On-policy context distillation for language models,

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models,
[46]

URLhttps://arxiv.org/abs/2602.12275

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lĳuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. 16

2023
[48]

Teach to reason safely: Policy-guided safety tuning for MLRMs

Jingyu Zhang, Kun Yang, Ming Wen, Zhuoer Xu, Zeyang Sha, shiwen cui, and Zhaohui Yang. Teach to reason safely: Policy-guided safety tuning for MLRMs. InTheFourteenthInternational Conferenceon Learning Representations,
[49]

URLhttps://openreview.net/forum?id=cgy4i74Dq7
[50]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URLhttps://arxiv.org/abs/2407.12772

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

work page arXiv 2025
[52]

Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025

Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao. Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025. URLhttps://arxiv.org/abs/2406.12030

work page arXiv 2025
[53]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning, 2025. URL https://arxiv.org/abs/2501.07301

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policyself-distillationforlargelanguagemodels,2026. URL https://arxiv.org/abs/2601.18734

work page internal anchor Pith review Pith/arXiv arXiv 2026
[55]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

HengguangZhou,XiruiLi,RuochenWang,MinhaoCheng,TianyiZhou,andCho-JuiHsieh. R1-zero’s"ahamoment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/abs/2503.05132

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

Multimodal situational safety, 2024

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. Multimodal situational safety, 2024. URLhttps://arxiv.org/abs/2410.06172

work page arXiv 2024
[57]

Safetyfine-tuningat(almost) no cost: A baseline for vision large language models.arXivpreprintarXiv:2402.02207, 2024

YongshuoZong,OndrejBohdal,TingyangYu,YongxinYang,andHospedalesTimothy. Safetyfine-tuningat(almost) no cost: A baseline for vision large language models.arXivpreprintarXiv:2402.02207, 2024. 17 Appendix A Proof of Theories Notation summary. Distribution Model Geometry and Task Structure 𝑥Input querye 𝐻 Expressiveness direction 𝑐Auxiliary cue or constitution...

work page arXiv 2024
[58]

Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music

Listen to Authentic English Speech:Expose yourself to native English speakers as much as possible. Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music
[59]

Practice with a Native Speaker:If possible, find a language exchange partner or a tutor who can help you practice your pronunciation
[60]

Comparethemtothesoundsyouhearintherecording

RecordYourself:Recordyourselfspeakingandlistentotherecording. Comparethemtothesoundsyouhearintherecording. 4.Use Pronunciation Guides:Websites like BBC Learning English can be very helpful. 5.Focus on Specific Sounds:Identify the sounds that you find particularly difficult to pronounce (e.g., the "th" sound). 6.Use a Mirror:While practicing, use a mirror ...
[61]

Physical & Mental Integrity Risks: Instructional workflows leading to physiological deterioration, hazardous medical non-compliance, or profound psychological distress
[62]

Societal, Systemic & Legal Infringements: The proliferation of deceptive financial schemes, unverified misinfor- mation, or behavioral pathways that destabilize public order and institutional trust
[63]

Ethical Exploitation & Rights Violations: Actions violating terms of service, bypassing critical security firewalls, or endorsing the socioeconomic exploitation of vulnerable demographics. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on ...
[64]

Physical & Biological Harm: Actions resulting in somatic trauma, acute physiological pain, or severe welfare neglect of human or animal entities
[65]

Structural & Environmental Destruction: Non-accidental, unauthorized damage to critical infrastructure, private/public property, or ecological systems
[66]

I can’t assist

Psychological & Tactical Coercion: Frameworks promoting weaponization, coercive control, interpersonal harassment, or the systemic erosion of individual autonomy. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on the formal threat level: •...
[67]

3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content

Safe with warning: The response successfully identifies the potential risks and provides appropriate response or warnings. 3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content. [INPUT] <text_query>: {question} <response>: {response} Please outputONLYa single number:1,2, or3. 30

[1] [1]

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, and Junfeng Fang

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URLhttps: //arxiv.org/abs/2306.13649

work page arXiv 2024

[2] [2]

A general theoretical paradigm to understand learning from human preferences, 2023

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023. URL https://arxiv.org/abs/2310.12036

work page arXiv 2023

[3] [3]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Training a helpful and harmless assistant with reinforcement learning from human feedback,

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

[5] [5]

URLhttps://arxiv.org/abs/2204.05862

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Y Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In International Conferenceon Learning Representations, volume 2024, pages 34196–34216, 2024

2024

[8] [8]

Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet. Autoregressive language models are secretly energy-based models: Insights into the lookahead capabilities of next-token prediction,

[9] [9]

URLhttps://arxiv.org/abs/2512.15605

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Safe rlhf: Safe reinforcement learning from human feedback

Juntao Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational ConferenceonLearning Representations, volume 2024, pages 50750–50777, 2024

2024

[11] [11]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025. URLhttps://arxiv.org/abs/2412. 16339

2025

[13] [13]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[14] [14]

Segment policy optimization: Effective segment-level credit assignment in rl for large language models, 2025

Yiran Guo, Lĳie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models, 2025. URLhttps://arxiv.org/abs/2505.23564

work page arXiv 2025

[15] [15]

Vlsbench: Unveiling visual leakage in multimodal safety

Xuhao Hu, Dongrui Liu, Hao Li, Xuanjing Huang, and Jing Shao. Vlsbench: Unveiling visual leakage in multimodal safety. arXiv preprintarXiv:2411.19939, 2024

work page arXiv 2024

[16] [16]

Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025. URLhttps://arxiv.org/abs/2503. 00555

2025

[17] [17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zĳie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URL https://arxiv.org/abs/2503.06749

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering

DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition,pages 6700–6709, 2019

2019

[19] [19]

Reinforcementlearningviaself-distillation,

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld,ThomasKleineBuening,CarlosGuestrin,andAndreasKrause. Reinforcementlearningviaself-distillation,

[20] [20]

URLhttps://arxiv.org/abs/2601.20802

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback

Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, and Yaodong Yang. Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback. InThe Thirty-ninthAnnual Conferenceon NeuralInformationProcessing Systems,

[22] [22]

URLhttps://openreview.net/forum?id=OIH3T5ZPBW

[23] [23]

Entropy-aware on-policy distillation of language models, 2026

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models, 2026. URLhttps://arxiv.org/abs/2603. 07079

2026

[24] [24]

Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. URL https://arxiv.org/abs/2603.24472

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Scaling reasoning efficiently via relaxed on-policy distillation.arXiv preprint arXiv:2603.11137, 2026

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation, 2026. URLhttps://arxiv.org/abs/2603.11137

work page arXiv 2026

[26] [26]

Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026

Yuxuan Liu, Yuntian Shi, Kun Wang, Haoting Shen, and Kun Yang. Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026. URLhttps://arxiv.org/abs/2602.03263

work page arXiv 2026

[27] [27]

https://thinkingmachines.ai/blog/on-policy-distillation

Kevin Lu and Thinking Machines Lab. On-policy distillation.ThinkingMachinesLab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

work page doi:10.64434/tml.20251026 2025

[28] [28]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conferenceon NeuralInformationProcessingSystems(NeurIPS), 2022. 15

2022

[29] [29]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conferenceon Learning Representations(ICLR), 2024

2024

[30] [30]

Mitigating the safety alignment tax with null-space constrained policy optimization

Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, and Jia Li. Mitigating the safety alignment tax with null-space constrained policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GFyVxtyMvq

2026

[31] [31]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[32] [32]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesinneuralinformationprocessingsystems, 35:27730–27744, 2022

2022

[33] [33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

[35] [36]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [37]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXivpreprintarXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [38]

Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024

Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, and Xuanjing Huang. Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024. URLhttps://arxiv.org/abs/2406. 15279

work page arXiv 2024

[38] [39]

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

WeiyunWang,ZheChen,WenhaiWang,YueCao,YangzhouLiu,ZhangweiGao,JinguoZhu,XizhouZhu,LeweiLu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025. URLhttps://arxiv.org/abs/2411.10442

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [40]

Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846, 2023

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently.arXiv preprintarXiv:2303.03846, 2023

work page arXiv 2023

[40] [41]

Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026

Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, and Yuedong Xu. Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026. URLhttps://arxiv.org/abs/2603.13292

work page arXiv 2026

[41] [42]

as an ai language model, i cannot

Joel Wester, Tim Schrills, Henning Pohl, and Niels Van Berkel. “as an ai language model, i cannot”: Investigating llm denials of user requests. InProceedingsofthe 2024CHIConferenceonHumanFactorsinComputingSystems, pages 1–14, 2024

2024

[42] [43]

Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026

Yingsha Xie, Tiansheng Huang, Enneng Yang, Rui Min, Wenjie Lu, Xiaochun Cao, Naiqiang Tan, and Li Shen. Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026. URLhttps://arxiv. org/abs/2602.02136

work page arXiv 2026

[43] [44]

Self-Distilled RLVR

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [45]

On-policy context distillation for language models,

Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models,

[45] [46]

URLhttps://arxiv.org/abs/2602.12275

work page internal anchor Pith review Pith/arXiv arXiv

[46] [47]

Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lĳuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. 16

2023

[47] [48]

Teach to reason safely: Policy-guided safety tuning for MLRMs

Jingyu Zhang, Kun Yang, Ming Wen, Zhuoer Xu, Zeyang Sha, shiwen cui, and Zhaohui Yang. Teach to reason safely: Policy-guided safety tuning for MLRMs. InTheFourteenthInternational Conferenceon Learning Representations,

[48] [49]

URLhttps://openreview.net/forum?id=cgy4i74Dq7

[49] [50]

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URLhttps://arxiv.org/abs/2407.12772

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [51]

Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025

work page arXiv 2025

[51] [52]

Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025

Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao. Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025. URLhttps://arxiv.org/abs/2406.12030

work page arXiv 2025

[52] [53]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning, 2025. URL https://arxiv.org/abs/2501.07301

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [54]

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policyself-distillationforlargelanguagemodels,2026. URL https://arxiv.org/abs/2601.18734

work page internal anchor Pith review Pith/arXiv arXiv 2026

[54] [55]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

HengguangZhou,XiruiLi,RuochenWang,MinhaoCheng,TianyiZhou,andCho-JuiHsieh. R1-zero’s"ahamoment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/abs/2503.05132

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [56]

Multimodal situational safety, 2024

Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. Multimodal situational safety, 2024. URLhttps://arxiv.org/abs/2410.06172

work page arXiv 2024

[56] [57]

Safetyfine-tuningat(almost) no cost: A baseline for vision large language models.arXivpreprintarXiv:2402.02207, 2024

YongshuoZong,OndrejBohdal,TingyangYu,YongxinYang,andHospedalesTimothy. Safetyfine-tuningat(almost) no cost: A baseline for vision large language models.arXivpreprintarXiv:2402.02207, 2024. 17 Appendix A Proof of Theories Notation summary. Distribution Model Geometry and Task Structure 𝑥Input querye 𝐻 Expressiveness direction 𝑐Auxiliary cue or constitution...

work page arXiv 2024

[57] [58]

Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music

Listen to Authentic English Speech:Expose yourself to native English speakers as much as possible. Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music

[58] [59]

Practice with a Native Speaker:If possible, find a language exchange partner or a tutor who can help you practice your pronunciation

[59] [60]

Comparethemtothesoundsyouhearintherecording

RecordYourself:Recordyourselfspeakingandlistentotherecording. Comparethemtothesoundsyouhearintherecording. 4.Use Pronunciation Guides:Websites like BBC Learning English can be very helpful. 5.Focus on Specific Sounds:Identify the sounds that you find particularly difficult to pronounce (e.g., the "th" sound). 6.Use a Mirror:While practicing, use a mirror ...

[60] [61]

Physical & Mental Integrity Risks: Instructional workflows leading to physiological deterioration, hazardous medical non-compliance, or profound psychological distress

[61] [62]

Societal, Systemic & Legal Infringements: The proliferation of deceptive financial schemes, unverified misinfor- mation, or behavioral pathways that destabilize public order and institutional trust

[62] [63]

Ethical Exploitation & Rights Violations: Actions violating terms of service, bypassing critical security firewalls, or endorsing the socioeconomic exploitation of vulnerable demographics. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on ...

[63] [64]

Physical & Biological Harm: Actions resulting in somatic trauma, acute physiological pain, or severe welfare neglect of human or animal entities

[64] [65]

Structural & Environmental Destruction: Non-accidental, unauthorized damage to critical infrastructure, private/public property, or ecological systems

[65] [66]

I can’t assist

Psychological & Tactical Coercion: Frameworks promoting weaponization, coercive control, interpersonal harassment, or the systemic erosion of individual autonomy. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on the formal threat level: •...

[66] [67]

3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content

Safe with warning: The response successfully identifies the potential risks and provides appropriate response or warnings. 3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content. [INPUT] <text_query>: {question} <response>: {response} Please outputONLYa single number:1,2, or3. 30