pith. machine review for the scientific record. sign in

arxiv: 2604.17299 · v2 · submitted 2026-04-19 · 💻 cs.CL · cs.AI

Recognition: unknown

Cat-DPO: Category-Adaptive Safety Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords safety alignmentdirect preference optimizationlarge language modelsharm categoriesadaptive marginhelpfulnessharmlessness
0
0 comments X

The pith

Cat-DPO replaces a single safety margin with separate adaptive margins for each harm category in direct preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most preference-based safety alignment applies one fixed margin across all query types, so average safety scores hide persistent weaknesses in particular harm categories. Cat-DPO treats alignment as a collection of per-category constrained problems and derives an algorithm whose margin for each category tightens when unsafe outputs continue and relaxes once the model improves on that category. The adjustment makes the training signal follow current difficulty rather than an averaged rate. Across two model backbones and six baselines the approach raises combined helpfulness and harmlessness while shrinking both variance and the best-to-worst gap per category. The change is offered as a direct substitution into existing direct-preference pipelines.

Core claim

Cat-DPO casts safety alignment as a per-category constrained optimization problem and supplies a direct-preference-optimization algorithm that keeps a distinct adaptive safety margin for every harm category. The margin is tightened while the model still produces unsafe responses in that category and is relaxed once performance catches up, so the gradient tracks each category's present difficulty instead of a global average. Experiments show the resulting models improve aggregate helpfulness and harmlessness while compressing per-category safety variance and narrowing the best-to-worst gap.

What carries the argument

The category-specific adaptive safety margin, which is updated during training according to the model's ongoing rate of unsafe outputs in that category and thereby redistributes optimization pressure away from already-safe categories.

If this is right

  • Aggregate helpfulness and harmlessness scores increase relative to uniform-margin baselines.
  • Per-category safety variance decreases and the gap between the safest and least-safe categories narrows.
  • The algorithm functions as a drop-in replacement inside existing direct-preference-optimization code.
  • Training effort is automatically concentrated on categories that still generate unsafe responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The adaptive-margin construction could be applied to other multi-objective alignment tasks that admit natural categories, such as truthfulness across domains.
  • If the per-category tracking remains stable, dataset re-balancing may become less critical for safety consistency.
  • Exposing the learned per-category margins at inference time would let users adjust safety emphasis by topic without retraining.

Load-bearing premise

An adaptive per-category margin can be maintained throughout training without introducing instability, over-refusal on borderline queries, or the need for extra category-specific hyperparameter search.

What would settle it

A controlled run in which Cat-DPO produces either larger per-category safety gaps than uniform baselines or measurable training instability and over-refusal rates on held-out queries would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.17299 by Kaize Ding, Ruiyao Xu, Tiankai Yang, Xinyuan Li, Yi Nian, Yue Zhao.

Figure 1
Figure 1. Figure 1: Aggregate harmlessness hides large per-category gaps. Per-category harmlessness (0–10) on PKU-SafeRLHF [17] for four methods trained on Alpaca-7B [7], showing the 8 worst of 19 categories and the overall average. Dashed lines mark the mean harmlessness over these categories for Cat-DPO (orange) and the strongest baseline SafeDPO (gray). Cat-DPO largely raises the average on these weak categories while matc… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Cat-DPO. Left: standard DPO trains the policy on preference pairs with a single uniform loss. Right: Cat-DPO augments the preference data with K categories and maintains a set of per-category adaptive margins. On each training batch, two steps execute jointly: ⃝1 the policy is updated via a DPO loss whose per-sample margin is assembled from the active categories’ dual variables {λk}, and ⃝2 eac… view at source ↗
Figure 3
Figure 3. Figure 3: Per-category balance on Alpaca-7B. Four summary statistics of the per-category Safe Ratio distribution over the K=19 harm categories: Macro (unweighted mean), Worst-3 (mean of the three lowest categories), Gap (max minus min, in percentage points), and Variance (cross-category variance, scaled by 103 ) under LLM-as-a-judge (top) and reward-model beaver-7b-unified-cost (bottom). Cat-DPO is the best method o… view at source ↗
Figure 4
Figure 4. Figure 4: a shows that λk correctly tracks lagging categories: the three hardest carry a persistently higher mean λk than the three easiest throughout training. Because a larger λk shifts the effective DPO margin (Equation (10)), these categories receive stronger gradient pressure. Figure 4b confirms the 0 500 1000 1500 2000 2500 3000 Training Step 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 M e a n k Wit hin Gro u p H… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sensitivity of Cat-DPO on Alpaca-7B. LLM-as-a-judge Safe Ratio and Helpfulness. Light vertical band marks the default hyperparameter value used in the main table. resulting effect: Cat-DPO’s per-category preference probability reaches the saturation ceiling ahead of DPO-bettersafe, and the gap closes only as bettersafe catches up later in training. Figure 4c shows that this advantage is not … view at source ↗
Figure 6
Figure 6. Figure 6: Per-category dual-variable λk trajectories on the Alpaca-7B Cat-DPO run at the default (η, ϵ)=(0.5, 0.02). Each colored line is one of the 19 PKU-SafeRLHF harm categories; the thick black line is the cross-category mean; the dashed reference line marks λ=10, the fixed margin used by SafeDPO. Categories are colored in descending order of their final-phase λk. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
read the original abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Cat-DPO, a direct preference optimization variant that formulates safety alignment as a per-category constrained optimization problem. It introduces separate adaptive safety margins per harm category that tighten when the model produces unsafe responses in that category and relax once performance improves, so the training signal tracks category-specific difficulty rather than a single global rate. The central empirical claim is that, across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness while compressing per-category safety variance and the best-to-worst gap.

Significance. If the empirical results and stability claims hold, Cat-DPO would provide a lightweight, drop-in refinement to existing DPO-style safety alignment that addresses uneven performance across harm categories without requiring new architectures or loss functions.

major comments (2)
  1. [Abstract] Abstract: the central claim of improved aggregate metrics and compressed per-category variance is asserted without any quantitative results, error bars, dataset sizes, or description of margin initialization/update rules, making it impossible to evaluate support for the claim or rule out confounds such as category imbalance.
  2. [Method] Method (margin adaptation): the update dynamics for the per-category margins (step size, unsafe-response detection threshold, relaxation schedule) are not specified, so it is impossible to verify the assumption that adaptation remains stable across categories without oscillation, over-refusal, or per-category hyperparameter search that would negate the claimed simplicity.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the specific harm categories and preference datasets used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review of our manuscript on Cat-DPO. We appreciate the feedback on the need for greater quantitative support in the abstract and explicit parameterization in the method section. We address each major comment below and will incorporate revisions to improve clarity and verifiability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of improved aggregate metrics and compressed per-category variance is asserted without any quantitative results, error bars, dataset sizes, or description of margin initialization/update rules, making it impossible to evaluate support for the claim or rule out confounds such as category imbalance.

    Authors: We agree that the abstract would be strengthened by including supporting quantitative details. In the revised manuscript we will add key results such as the observed improvements in aggregate helpfulness and harmlessness, the measured compression of per-category safety variance, and the reduction in the best-to-worst gap, together with error bars from our multi-seed experiments. Dataset sizes and category balance are already reported in the experimental setup; we will also insert a concise reference to margin initialization (zero start) and the per-category update rule, directing readers to the method section for the full formulation. These additions will allow direct evaluation of the claims from the abstract while preserving its brevity. revision: yes

  2. Referee: [Method] Method (margin adaptation): the update dynamics for the per-category margins (step size, unsafe-response detection threshold, relaxation schedule) are not specified, so it is impossible to verify the assumption that adaptation remains stable across categories without oscillation, over-refusal, or per-category hyperparameter search that would negate the claimed simplicity.

    Authors: The manuscript currently describes the adaptation process at a conceptual level. We acknowledge that explicit values and schedules for step size, the unsafe-response detection threshold, and the relaxation rule are not stated. In the revision we will supply the precise update equations, the fixed step size employed, the threshold used by the safety classifier for flagging unsafe outputs, and the linear relaxation schedule applied when category performance improves. We will also add a short stability analysis and an ablation confirming that a single global hyperparameter set suffices across categories, thereby supporting the claimed simplicity without per-category tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic refinement with independent adaptive rule

full rationale

The paper frames safety alignment as per-category constrained optimization and introduces Cat-DPO by adding an adaptive margin per harm category that tightens or relaxes according to observed unsafe outputs during training. This is presented as a direct algorithmic change to the DPO loss rather than any re-expression of fitted values or self-referential definition. No equations in the provided text reduce the claimed variance compression or aggregate gains to quantities defined by the method itself. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-citations or imported uniqueness theorems for its core claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The adaptive margin mechanism implicitly requires at least one update rule or threshold per category whose concrete form is not supplied.

pith-pipeline@v0.9.0 · 5473 in / 1190 out tokens · 52059 ms · 2026-05-10T05:30:06.065537+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 6 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  2. [2]

    Ad-llm: Benchmarking large language models for anomaly detection

    Tiankai Yang, Yi Nian, Li Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan A Rossi, Kaize Ding, et al. Ad-llm: Benchmarking large language models for anomaly detection. InFindings of the Association for Computational Linguistics: ACL 2025, pages 1524–1547, 2025

  3. [3]

    A personalized conversational benchmark: Towards simulating personalized conversations

    Li Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, et al. A personalized conversational benchmark: Towards simulating personalized conversations.arXiv preprint arXiv:2505.14106, 2025

  4. [4]

    No attacker needed: Unintentional cross-user contamination in shared-state llm agents

    Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, and Yue Zhao. No attacker needed: Unintentional cross-user contamination in shared-state llm agents. arXiv preprint arXiv:2604.01350, 2026

  5. [5]

    Auditable Agents

    Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, and Yue Zhao. Auditable agents.arXiv preprint arXiv:2604.05485, 2026

  6. [6]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  7. [7]

    Safe RLHF: Safe reinforcement learning from human feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024

  8. [8]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  9. [9]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

  10. [10]

    SafeDPO: A simple approach to direct preference optimization with enhanced safety

    Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, and Moontae Lee. SafeDPO: A simple approach to direct preference optimization with enhanced safety. InThe Fourteenth International Conference on Learning Representations, 2026

  11. [11]

    Stepwise alignment for constrained language model policy optimization.Advances in Neural Information Processing Systems, 37:104471–104520, 2024

    Akifumi Wachi, Thien Q Tran, Rei Sato, Takumi Tanabe, and Youhei Akimoto. Stepwise alignment for constrained language model policy optimization.Advances in Neural Information Processing Systems, 37:104471–104520, 2024

  12. [12]

    Hashimoto, and Percy Liang

    Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. InInternational Conference on Learning Representations, 2020

  13. [13]

    Group robust preference optimization in reward-free rlhf.Advances in Neural Information Processing Systems, 37:37100–37137, 2024

    Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, and Ilija Bogunovic. Group robust preference optimization in reward-free rlhf.Advances in Neural Information Processing Systems, 37:37100–37137, 2024

  14. [14]

    Towards robust alignment of language models: Distributionally robustifying direct preference optimization

    Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Towards robust alignment of language models: Distributionally robustifying direct preference optimization. InThe Thirteenth International Conference on Learning Representations, 2025

  15. [15]

    Robust LLM alignment via distributionally robust direct preference optimization

    Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, and Deepak Ra- machandran. Robust LLM alignment via distributionally robust direct preference optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 10

  16. [16]

    Direct preference optimization with unobserved preference heterogeneity: The necessity of ternary preferences

    Keertana Chidambaram, Karthik Vinary Seetharaman, and Vasilis Syrgkanis. Direct preference optimization with unobserved preference heterogeneity: The necessity of ternary preferences. arXiv preprint arXiv:2510.15716, 2025

  17. [17]

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31983–32016, 2025

  18. [18]

    Rank analysis of incomplete block designs: I

    Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  19. [19]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

  20. [20]

    Cambridge University Press, 2004

    Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004

  21. [21]

    Stanford University Press Stanford, 1958

    Kenneth Joseph Arrow, Leonid Hurwicz, Hirofumi Uzawa, Hollis Burnley Chenery, Selmer Johnson, and Samuel Karlin.Studies in linear and non-linear programming, volume 2. Stanford University Press Stanford, 1958

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  23. [23]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  24. [24]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  25. [25]

    From hard refusals to safe-completions: Toward output-centric safety training

    Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. From hard refusals to safe-completions: Toward output-centric safety training.arXiv preprint arXiv:2508.09224, 2025

  26. [26]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  27. [27]

    Springer, 2008

    Vivek S Borkar and Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer, 2008

  28. [28]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  29. [29]

    GNN-as-judge: Unleashing the power of LLMs for graph learning with GNN feedback

    Ruiyao Xu and Kaize Ding. GNN-as-judge: Unleashing the power of LLMs for graph learning with GNN feedback. InThe Fourteenth International Conference on Learning Representations, 2026

  30. [30]

    Coact: Co-active llm preference learning with human-ai synergy, 2026

    Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, and Kaize Ding. Coact: Co-active llm preference learning with human-ai synergy, 2026

  31. [31]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024. 11

  32. [32]

    Model alignment as prospect theoretic optimization

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. InForty-first International Conference on Machine Learning, 2024

  33. [33]

    Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

    Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

  34. [34]

    AlphaDPO: Adaptive reward margin for direct preference optimization

    Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. AlphaDPO: Adaptive reward margin for direct preference optimization. In Forty-second International Conference on Machine Learning, 2025

  35. [35]

    Margin adaptive dpo: Leveraging reward model for granular control in preference optimization.arXiv preprint arXiv:2510.05342, 2025

    Hyung Gyu Rho. Margin adaptive dpo: Leveraging reward model for granular control in preference optimization.arXiv preprint arXiv:2510.05342, 2025

  36. [36]

    Amapo: Adaptive margin-attached preference optimization for language model alignment, 2025

    Ruibo Deng, Duanyu Feng, and Wenqiang Lei. Amapo: Adaptive margin-attached preference optimization for language model alignment, 2025

  37. [37]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  38. [38]

    Constrained reinforcement learning has zero duality gap.Advances in Neural Information Processing Systems, 32, 2019

    Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap.Advances in Neural Information Processing Systems, 32, 2019

  39. [39]

    Natural policy gradient primal-dual method for constrained markov decision processes.Advances in Neural Information Processing Systems, 33:8378–8390, 2020

    Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes.Advances in Neural Information Processing Systems, 33:8378–8390, 2020

  40. [40]

    Mankowitz, and Shie Mannor

    Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. InInternational Conference on Learning Representations, 2019

  41. [41]

    One-shot safety alignment for large language models via optimal dualization.Advances in Neural Information Processing Systems, 37:84350–84383, 2024

    Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, and Dongsheng Ding. One-shot safety alignment for large language models via optimal dualization.Advances in Neural Information Processing Systems, 37:84350–84383, 2024

  42. [42]

    Fairness without demographics in repeated loss minimization

    Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. InInternational Conference on Machine Learning, pages 1929–1938. PMLR, 2018

  43. [43]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024

  44. [44]

    binding-category-only

    OpenAI. GPT-4.1.https://openai.com/index/gpt-4-1/, 2025. 12 A Full Derivation of Cat-DPO We work in the same setting as Section 3: policy πθ, reference model πref, latent helpfulness reward r(x, y), binary unsafe indicator h(x, y)∈ {0,1} , category set C(x)⊆ K , per-category constrained problem (4), and category-augmented reward (6). A.1 Closed-Form Optim...

  45. [45]

    How do I cook pasta?

    benign: A normal, safe question with no safety risk. Examples: “How do I cook pasta?”, “Explain quantum physics.”

  46. [46]

    How do knife safety practices work?

    risky_but_answerable: The prompt touches on sensitive topics but CAN be answered safely and helpfully with appropriate care. Examples: “How do knife safety practices work?”, “What are common signs of drug addiction?”, “How do hackers typically break into systems?” (can be answered for defensive security education)

  47. [47]

    How do I make a bomb?

    inherently_harmful: The prompt is explicitly requesting harmful content that cannot be responsibly answered. Examples: “How do I make a bomb?”, “Give me a plan to kidnap someone.” Output your answer in the following JSON format only, no other text: {"class": "<benign | risky_but_answerable | inherently_harmful>", "reason": "<one sentence explanation>"} Us...