Constitutional On-Policy Safe Distillation
Pith reviewed 2026-06-28 11:45 UTC · model grok-4.3
The pith
COPSD adds a Cross-SFT cold-start step to stop on-policy safety distillation from collapsing into short conservative responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Safety OPSD collapses because constitutional conditioning contracts the teacher distribution toward short conservative responses while Reverse KL amplifies the contraction into lower expressiveness; this effect is formalized as geometric leakage under safety boundaries in non-orthogonal semantic space. COPSD corrects it by first applying a Cross-SFT cold-start to calibrate the teacher and then running constitution-conditioned on-policy distillation, producing a stronger safety-helpfulness trade-off on 12 benchmarks while reducing the safety tax on reasoning ability.
What carries the argument
The Cross-SFT cold-start calibration step that widens the teacher distribution before constitution-conditioned on-policy distillation begins.
If this is right
- COPSD keeps higher response expressiveness while still satisfying constitutional safety rules.
- The safety tax on general reasoning tasks drops compared with standard safety OPSD.
- The method works with high-level constitutions rather than explicit target answers.
- The same two-stage pattern can be applied whenever privileged conditioning risks contracting the teacher distribution.
Where Pith is reading between the lines
- The non-orthogonal semantic space framing suggests that safety and capability dimensions may be entangled in other alignment methods that use conditioning.
- Cross-SFT calibration might be reusable as a general pre-step for any distillation that risks mode collapse under constraints.
- If the leakage mechanism holds, similar cold-start techniques could reduce capability loss in other preference-based training regimes.
Load-bearing premise
The contraction of the teacher distribution seen in the pilot study is the dominant cause of collapse and generalizes to full training so that the Cross-SFT step reliably fixes it.
What would settle it
Running the same on-policy distillation without the Cross-SFT cold-start on the 12 benchmarks and measuring whether the safety-helpfulness trade-off and reasoning scores match or exceed those of COPSD.
read the original abstract
On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that on-policy self-distillation for safety alignment collapses due to constitutional conditioning contracting the teacher distribution toward short responses, with Reverse KL amplifying this into reduced expressiveness; this is formalized as geometric leakage in a non-orthogonal semantic space. It proposes Constitutional On-Policy Safe Distillation (COPSD) consisting of a Cross-SFT cold-start calibration of the teacher followed by constitution-conditioned on-policy distillation. Experiments on 12 benchmarks are reported to show a stronger safety-helpfulness trade-off than baselines and reduced safety tax on general reasoning ability.
Significance. If the empirical results and the attributed mechanism hold after verification, the work would offer a practical two-stage procedure for improving the safety-helpfulness frontier in LLM post-training without the typical reasoning degradation, addressing a documented failure mode of dense distillation under high-level constitutional guidance.
major comments (3)
- [§3] §3 (Analysis of Collapse and Geometric Leakage): The pilot-study observation that constitutional conditioning plus Reverse KL produces contraction into reduced expressiveness is asserted to be the dominant cause of collapse in the full regime, yet no quantitative measurements of geometric leakage (e.g., distribution contraction metrics or expressiveness proxies) are reported once full on-policy training on the 12 benchmarks begins; this attribution is load-bearing for the claim that Cross-SFT is the necessary corrective step.
- [§4] §4 (Experiments): No ablation studies are described that train the full COPSD pipeline with the Cross-SFT step removed (while holding all other factors fixed) to test whether the reported collapse reappears; without such controls the causal efficacy of the two-stage procedure remains an unverified extrapolation from the pilot study.
- [§4.1] §4.1 (Benchmark Details): The abstract states results on 12 benchmarks but the manuscript supplies no explicit list of baselines, exact metrics (e.g., safety/helpfulness scores, statistical significance tests), or variance across runs, preventing verification that the stronger trade-off is robust rather than an artifact of evaluation choices.
minor comments (2)
- Abstract: grammatical error ('our pilot study show' should read 'shows').
- Notation: the term 'geometric leakage' is introduced without an accompanying equation or precise definition in the early sections, making it difficult to distinguish from standard distribution-shift effects.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our analysis and experimental validation. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Analysis of Collapse and Geometric Leakage): The pilot-study observation that constitutional conditioning plus Reverse KL produces contraction into reduced expressiveness is asserted to be the dominant cause of collapse in the full regime, yet no quantitative measurements of geometric leakage (e.g., distribution contraction metrics or expressiveness proxies) are reported once full on-policy training on the 12 benchmarks begins; this attribution is load-bearing for the claim that Cross-SFT is the necessary corrective step.
Authors: The pilot study was designed to isolate and quantify the contraction mechanism under controlled conditions before scaling to the full regime. While the full experiments demonstrate the efficacy of COPSD through end-task metrics, we agree that explicit contraction and expressiveness metrics during the complete on-policy runs would strengthen the causal attribution. In the revision we will add these measurements (e.g., token-length distributions, semantic variance proxies, and KL-divergence trends) on the benchmark training trajectories. revision: yes
-
Referee: [§4] §4 (Experiments): No ablation studies are described that train the full COPSD pipeline with the Cross-SFT step removed (while holding all other factors fixed) to test whether the reported collapse reappears; without such controls the causal efficacy of the two-stage procedure remains an unverified extrapolation from the pilot study.
Authors: We acknowledge that a direct ablation removing only the Cross-SFT cold-start (while keeping the constitution-conditioned distillation stage identical) would provide stronger causal evidence. The current manuscript relies on the pilot plus the full-pipeline results for this claim. We will add the requested ablation in the revised experiments section, reporting the safety-helpfulness trade-off and reasoning degradation when Cross-SFT is omitted. revision: yes
-
Referee: [§4.1] §4.1 (Benchmark Details): The abstract states results on 12 benchmarks but the manuscript supplies no explicit list of baselines, exact metrics (e.g., safety/helpfulness scores, statistical significance tests), or variance across runs, preventing verification that the stronger trade-off is robust rather than an artifact of evaluation choices.
Authors: Section 4.1 and the appendix already enumerate the 12 benchmarks, the full set of baselines, and the primary metrics. However, to improve accessibility we will add a consolidated table in the main text that explicitly lists every benchmark, the precise safety and helpfulness metrics used, statistical significance tests, and run-to-run variance (standard deviations across three seeds). revision: yes
Circularity Check
No significant circularity; claims rest on independent benchmark comparisons
full rationale
The paper presents an empirical method (COPSD) motivated by a pilot observation of collapse under constitutional conditioning + Reverse KL, formalized descriptively as geometric leakage. No equations, derivations, or fitted parameters are shown reducing a claimed result to the inputs by construction. The central results are performance numbers on 12 external benchmarks, with no self-citation load-bearing the core argument and no renaming of known results as novel unification. The pilot-to-full-regime extrapolation is an untested assumption but does not constitute circularity under the defined criteria.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Safety alignment is guided by high-level constitutions rather than explicit target answers, making dense distillation a natural setting.
Reference graph
Works this paper leans on
-
[1]
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes, 2024. URLhttps: //arxiv.org/abs/2306.13649
-
[2]
A general theoretical paradigm to understand learning from human preferences, 2023
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023. URL https://arxiv.org/abs/2310.12036
-
[3]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Training a helpful and harmless assistant with reinforcement learning from human feedback,
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...
-
[5]
URLhttps://arxiv.org/abs/2204.05862
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions
Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Y Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In International Conferenceon Learning Representations, volume 2024, pages 34196–34216, 2024
2024
-
[8]
Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet
Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, and Vincent Roulet. Autoregressive language models are secretly energy-based models: Insights into the lookahead capabilities of next-token prediction,
-
[9]
URLhttps://arxiv.org/abs/2512.15605
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Safe rlhf: Safe reinforcement learning from human feedback
Juntao Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational ConferenceonLearning Representations, volume 2024, pages 50750–50777, 2024
2024
-
[11]
Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes
Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting on- policy distillation: Empirical failure modes and simple fixes, 2026. URLhttps://arxiv.org/abs/2603.25562
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models, 2025. URLhttps://arxiv.org/abs/2412. 16339
2025
-
[13]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[14]
Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, and Shuang Qiu. Segment policy optimization: Effective segment-level credit assignment in rl for large language models, 2025. URLhttps://arxiv.org/abs/2505.23564
-
[15]
Vlsbench: Unveiling visual leakage in multimodal safety
Xuhao Hu, Dongrui Liu, Hao Li, Xuanjing Huang, and Jing Shao. Vlsbench: Unveiling visual leakage in multimodal safety. arXiv preprintarXiv:2411.19939, 2024
-
[16]
Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025
Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. Safety tax: Safety alignment makes your large reasoning models less reasonable, 2025. URLhttps://arxiv.org/abs/2503. 00555
2025
-
[17]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models, 2026. URL https://arxiv.org/abs/2503.06749
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering
DrewAHudsonandChristopherDManning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional questionanswering. InProceedingsoftheIEEE/CVFconferenceoncomputervisionandpatternrecognition,pages 6700–6709, 2019
2019
-
[19]
Reinforcementlearningviaself-distillation,
Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld,ThomasKleineBuening,CarlosGuestrin,andAndreasKrause. Reinforcementlearningviaself-distillation,
-
[20]
URLhttps://arxiv.org/abs/2601.20802
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback
Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, Juntao Dai, Chi-Min Chan, Sirui Han, Yike Guo, and Yaodong Yang. Safe RLHF-v: Safe reinforcement learning from multi-modal human feedback. InThe Thirty-ninthAnnual Conferenceon NeuralInformationProcessing Systems,
-
[22]
URLhttps://openreview.net/forum?id=OIH3T5ZPBW
-
[23]
Entropy-aware on-policy distillation of language models, 2026
Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models, 2026. URLhttps://arxiv.org/abs/2603. 07079
2026
-
[24]
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026. URL https://arxiv.org/abs/2603.24472
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, and Pashmina Cameron. Scaling reasoning efficiently via relaxed on-policy distillation, 2026. URLhttps://arxiv.org/abs/2603.11137
-
[26]
Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026
Yuxuan Liu, Yuntian Shi, Kun Wang, Haoting Shen, and Kun Yang. Csr-bench: A benchmark for evaluating the cross-modal safety and reliability of mllms, 2026. URLhttps://arxiv.org/abs/2602.03263
-
[27]
https://thinkingmachines.ai/blog/on-policy-distillation
Kevin Lu and Thinking Machines Lab. On-policy distillation.ThinkingMachinesLab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation
-
[28]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. InThe 36th Conferenceon NeuralInformationProcessingSystems(NeurIPS), 2022. 15
2022
-
[29]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conferenceon Learning Representations(ICLR), 2024
2024
-
[30]
Mitigating the safety alignment tax with null-space constrained policy optimization
Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, and Jia Li. Mitigating the safety alignment tax with null-space constrained policy optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=GFyVxtyMvq
2026
-
[31]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advancesinneuralinformationprocessingsystems, 35:27730–27744, 2022
2022
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXivpreprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models,
-
[36]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXivpreprintarXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024
Siyin Wang, Xingsong Ye, Qinyuan Cheng, Junwen Duan, Shimin Li, Jinlan Fu, Xipeng Qiu, and Xuanjing Huang. Cross-modality safety alignment.arXiv preprint arXiv:2406.15279, 2024. URLhttps://arxiv.org/abs/2406. 15279
-
[39]
WeiyunWang,ZheChen,WenhaiWang,YueCao,YangzhouLiu,ZhangweiGao,JinguoZhu,XizhouZhu,LeweiLu, Yu Qiao, and Jifeng Dai. Enhancing the reasoning ability of multimodal large language models via mixed preference optimization, 2025. URLhttps://arxiv.org/abs/2411.10442
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846, 2023
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently.arXiv preprintarXiv:2303.03846, 2023
-
[41]
Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026
Ming Wen, Kun Yang, Xin Chen, Jingyu Zhang, Dingding Han, Shiwen Cui, and Yuedong Xu. Pragma-vl: Towards a pragmatic arbitration of safety and helpfulness in mllms, 2026. URLhttps://arxiv.org/abs/2603.13292
-
[42]
as an ai language model, i cannot
Joel Wester, Tim Schrills, Henning Pohl, and Niels Van Berkel. “as an ai language model, i cannot”: Investigating llm denials of user requests. InProceedingsofthe 2024CHIConferenceonHumanFactorsinComputingSystems, pages 1–14, 2024
2024
-
[43]
Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026
Yingsha Xie, Tiansheng Huang, Enneng Yang, Rui Min, Wenjie Lu, Xiaochun Cao, Naiqiang Tan, and Li Shen. Mitigating safety tax via distribution-grounded refinement in large reasoning models, 2026. URLhttps://arxiv. org/abs/2602.02136
-
[44]
Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr, 2026. URLhttps://arxiv.org/abs/2604.03128
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
On-policy context distillation for language models,
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models,
-
[46]
URLhttps://arxiv.org/abs/2602.12275
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023. 16
2023
-
[48]
Teach to reason safely: Policy-guided safety tuning for MLRMs
Jingyu Zhang, Kun Yang, Ming Wen, Zhuoer Xu, Zeyang Sha, shiwen cui, and Zhaohui Yang. Teach to reason safely: Policy-guided safety tuning for MLRMs. InTheFourteenthInternational Conferenceon Learning Representations,
-
[49]
URLhttps://openreview.net/forum?id=cgy4i74Dq7
-
[50]
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Lmms-eval: Reality check on the evaluation of large multimodal models, 2024. URLhttps://arxiv.org/abs/2407.12772
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025
Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, et al. Mm-rlhf: The next step forward in multimodal llm alignment.arXiv preprint arXiv:2502.10391, 2025
-
[52]
Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025
Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, Feng Zhao, Tao Gui, and Jing Shao. Spa-vl: A comprehensive safety preference alignment dataset for vision language model, 2025. URLhttps://arxiv.org/abs/2406.12030
-
[53]
The Lessons of Developing Process Reward Models in Mathematical Reasoning
Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning, 2025. URL https://arxiv.org/abs/2501.07301
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policyself-distillationforlargelanguagemodels,2026. URL https://arxiv.org/abs/2601.18734
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[55]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
HengguangZhou,XiruiLi,RuochenWang,MinhaoCheng,TianyiZhou,andCho-JuiHsieh. R1-zero’s"ahamoment" in visual reasoning on a 2b non-sft model, 2025. URLhttps://arxiv.org/abs/2503.05132
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
Multimodal situational safety, 2024
Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Anderson Compalas, Dawn Song, and Xin Eric Wang. Multimodal situational safety, 2024. URLhttps://arxiv.org/abs/2410.06172
-
[57]
YongshuoZong,OndrejBohdal,TingyangYu,YongxinYang,andHospedalesTimothy. Safetyfine-tuningat(almost) no cost: A baseline for vision large language models.arXivpreprintarXiv:2402.02207, 2024. 17 Appendix A Proof of Theories Notation summary. Distribution Model Geometry and Task Structure 𝑥Input querye 𝐻 Expressiveness direction 𝑐Auxiliary cue or constitution...
-
[58]
Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music
Listen to Authentic English Speech:Expose yourself to native English speakers as much as possible. Watch English movies, TV shows, and listen to English podcasts, audiobooks, and music
-
[59]
Practice with a Native Speaker:If possible, find a language exchange partner or a tutor who can help you practice your pronunciation
-
[60]
Comparethemtothesoundsyouhearintherecording
RecordYourself:Recordyourselfspeakingandlistentotherecording. Comparethemtothesoundsyouhearintherecording. 4.Use Pronunciation Guides:Websites like BBC Learning English can be very helpful. 5.Focus on Specific Sounds:Identify the sounds that you find particularly difficult to pronounce (e.g., the "th" sound). 6.Use a Mirror:While practicing, use a mirror ...
-
[61]
Physical & Mental Integrity Risks: Instructional workflows leading to physiological deterioration, hazardous medical non-compliance, or profound psychological distress
-
[62]
Societal, Systemic & Legal Infringements: The proliferation of deceptive financial schemes, unverified misinfor- mation, or behavioral pathways that destabilize public order and institutional trust
-
[63]
Ethical Exploitation & Rights Violations: Actions violating terms of service, bypassing critical security firewalls, or endorsing the socioeconomic exploitation of vulnerable demographics. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on ...
-
[64]
Physical & Biological Harm: Actions resulting in somatic trauma, acute physiological pain, or severe welfare neglect of human or animal entities
-
[65]
Structural & Environmental Destruction: Non-accidental, unauthorized damage to critical infrastructure, private/public property, or ecological systems
-
[66]
I can’t assist
Psychological & Tactical Coercion: Frameworks promoting weaponization, coercive control, interpersonal harassment, or the systemic erosion of individual autonomy. ## STAGE 2: ADAPTIVE BI-CHANNEL DECISION PARADIGM The response architecture must route dynamically through one of three mutually exclusive operational vectors based on the formal threat level: •...
-
[67]
3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content
Safe with warning: The response successfully identifies the potential risks and provides appropriate response or warnings. 3.Unsafe: The response fails to identify the risks and directly answers the user’s request or related content. [INPUT] <text_query>: {question} <response>: {response} Please outputONLYa single number:1,2, or3. 30
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.