arxiv: 2604.17299 · v2 · submitted 2026-04-19 · 💻 cs.CL · cs.AI

Recognition: unknown

Cat-DPO: Category-Adaptive Safety Alignment

Tiankai Yang , Yi Nian , Xinyuan Li , Ruiyao Xu , Kaize Ding , Yue Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords safety alignmentdirect preference optimizationlarge language modelsharm categoriesadaptive marginhelpfulnessharmlessness

0 comments

The pith

Cat-DPO replaces a single safety margin with separate adaptive margins for each harm category in direct preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most preference-based safety alignment applies one fixed margin across all query types, so average safety scores hide persistent weaknesses in particular harm categories. Cat-DPO treats alignment as a collection of per-category constrained problems and derives an algorithm whose margin for each category tightens when unsafe outputs continue and relaxes once the model improves on that category. The adjustment makes the training signal follow current difficulty rather than an averaged rate. Across two model backbones and six baselines the approach raises combined helpfulness and harmlessness while shrinking both variance and the best-to-worst gap per category. The change is offered as a direct substitution into existing direct-preference pipelines.

Core claim

Cat-DPO casts safety alignment as a per-category constrained optimization problem and supplies a direct-preference-optimization algorithm that keeps a distinct adaptive safety margin for every harm category. The margin is tightened while the model still produces unsafe responses in that category and is relaxed once performance catches up, so the gradient tracks each category's present difficulty instead of a global average. Experiments show the resulting models improve aggregate helpfulness and harmlessness while compressing per-category safety variance and narrowing the best-to-worst gap.

What carries the argument

The category-specific adaptive safety margin, which is updated during training according to the model's ongoing rate of unsafe outputs in that category and thereby redistributes optimization pressure away from already-safe categories.

If this is right

Aggregate helpfulness and harmlessness scores increase relative to uniform-margin baselines.
Per-category safety variance decreases and the gap between the safest and least-safe categories narrows.
The algorithm functions as a drop-in replacement inside existing direct-preference-optimization code.
Training effort is automatically concentrated on categories that still generate unsafe responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The adaptive-margin construction could be applied to other multi-objective alignment tasks that admit natural categories, such as truthfulness across domains.
If the per-category tracking remains stable, dataset re-balancing may become less critical for safety consistency.
Exposing the learned per-category margins at inference time would let users adjust safety emphasis by topic without retraining.

Load-bearing premise

An adaptive per-category margin can be maintained throughout training without introducing instability, over-refusal on borderline queries, or the need for extra category-specific hyperparameter search.

What would settle it

A controlled run in which Cat-DPO produces either larger per-category safety gaps than uniform baselines or measurable training instability and over-refusal rates on held-out queries would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2604.17299 by Kaize Ding, Ruiyao Xu, Tiankai Yang, Xinyuan Li, Yi Nian, Yue Zhao.

**Figure 1.** Figure 1: Aggregate harmlessness hides large per-category gaps. Per-category harmlessness (0–10) on PKU-SafeRLHF [17] for four methods trained on Alpaca-7B [7], showing the 8 worst of 19 categories and the overall average. Dashed lines mark the mean harmlessness over these categories for Cat-DPO (orange) and the strongest baseline SafeDPO (gray). Cat-DPO largely raises the average on these weak categories while matc… view at source ↗

**Figure 2.** Figure 2: Overview of Cat-DPO. Left: standard DPO trains the policy on preference pairs with a single uniform loss. Right: Cat-DPO augments the preference data with K categories and maintains a set of per-category adaptive margins. On each training batch, two steps execute jointly: ⃝1 the policy is updated via a DPO loss whose per-sample margin is assembled from the active categories’ dual variables {λk}, and ⃝2 eac… view at source ↗

**Figure 3.** Figure 3: Per-category balance on Alpaca-7B. Four summary statistics of the per-category Safe Ratio distribution over the K=19 harm categories: Macro (unweighted mean), Worst-3 (mean of the three lowest categories), Gap (max minus min, in percentage points), and Variance (cross-category variance, scaled by 103 ) under LLM-as-a-judge (top) and reward-model beaver-7b-unified-cost (bottom). Cat-DPO is the best method o… view at source ↗

**Figure 4.** Figure 4: a shows that λk correctly tracks lagging categories: the three hardest carry a persistently higher mean λk than the three easiest throughout training. Because a larger λk shifts the effective DPO margin (Equation (10)), these categories receive stronger gradient pressure. Figure 4b confirms the 0 500 1000 1500 2000 2500 3000 Training Step 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 M e a n k Wit hin Gro u p H… view at source ↗

**Figure 5.** Figure 5: Hyperparameter sensitivity of Cat-DPO on Alpaca-7B. LLM-as-a-judge Safe Ratio and Helpfulness. Light vertical band marks the default hyperparameter value used in the main table. resulting effect: Cat-DPO’s per-category preference probability reaches the saturation ceiling ahead of DPO-bettersafe, and the gap closes only as bettersafe catches up later in training. Figure 4c shows that this advantage is not … view at source ↗

**Figure 6.** Figure 6: Per-category dual-variable λk trajectories on the Alpaca-7B Cat-DPO run at the default (η, ϵ)=(0.5, 0.02). Each colored line is one of the 19 PKU-SafeRLHF harm categories; the thick black line is the cross-category mean; the dashed reference line marks λ=10, the fixed margin used by SafeDPO. Categories are colored in descending order of their final-phase λk. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Aligning large language models with human preferences must balance two competing goals: responding helpfully to legitimate requests and reliably refusing harmful ones. Most preference-based safety alignment methods collapse safety into a single scalar that is applied uniformly to every preference pair. The result is a model that looks safe on average but stays relatively unsafe on a minority of harm categories. We cast safety alignment as a per-category constrained optimization problem and derive Cat-DPO, a direct-preference-optimization algorithm with a separate adaptive safety margin for each harm category. The margin tightens when the model still produces unsafe responses on a category and relaxes once the model catches up, so the training signal tracks each category's current difficulty rather than averaging under one global rate. Across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness and compresses per-category safety variance and the best-to-worst gap, offering a drop-in per-category refinement of direct preference safety alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cat-DPO adds per-category adaptive margins to DPO to fix uneven safety coverage, but the abstract gives no numbers or update details so the gains are hard to assess.

read the letter

The main point is that Cat-DPO treats safety alignment as a per-category constrained optimization inside DPO, with a separate adaptive margin for each harm category. The margin tightens when the model still produces unsafe outputs in that category and relaxes once performance catches up, so the training signal follows each category's difficulty instead of a single global rate. This directly targets the common pattern where average safety scores hide failures in specific areas. The paper tests the approach on two LLM backbones against six baselines and reports gains in aggregate helpfulness and harmlessness plus reduced per-category variance and smaller best-to-worst gaps. That is a practical observation worth checking. The idea itself is a straightforward extension of existing preference optimization rather than a complete reinvention. What is missing from the abstract is any quantitative result, any description of how the margins are initialized or updated, and any discussion of category balance or detection thresholds. Without those pieces it is difficult to tell whether the adaptation stays stable or simply moves the hyperparameter burden to per-category settings. The stress-test concern about possible training instability or oscillation in low-data categories is reasonable given the lack of detail. This work is aimed at people who already tune LLMs with DPO or similar methods and want a simple way to handle category imbalance. If the full paper supplies clear equations, reproducible code, and properly reported experiments with error bars, it would be worth a referee's time to verify whether the claimed compression of variance holds up or depends on the specific setup. I would send it for peer review to get the implementation and results examined properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes Cat-DPO, a direct preference optimization variant that formulates safety alignment as a per-category constrained optimization problem. It introduces separate adaptive safety margins per harm category that tighten when the model produces unsafe responses in that category and relax once performance improves, so the training signal tracks category-specific difficulty rather than a single global rate. The central empirical claim is that, across two LLM backbones and six preference-learning baselines, Cat-DPO improves aggregate helpfulness and harmlessness while compressing per-category safety variance and the best-to-worst gap.

Significance. If the empirical results and stability claims hold, Cat-DPO would provide a lightweight, drop-in refinement to existing DPO-style safety alignment that addresses uneven performance across harm categories without requiring new architectures or loss functions.

major comments (2)

[Abstract] Abstract: the central claim of improved aggregate metrics and compressed per-category variance is asserted without any quantitative results, error bars, dataset sizes, or description of margin initialization/update rules, making it impossible to evaluate support for the claim or rule out confounds such as category imbalance.
[Method] Method (margin adaptation): the update dynamics for the per-category margins (step size, unsafe-response detection threshold, relaxation schedule) are not specified, so it is impossible to verify the assumption that adaptation remains stable across categories without oscillation, over-refusal, or per-category hyperparameter search that would negate the claimed simplicity.

minor comments (1)

[Abstract] The abstract would be clearer if it named the specific harm categories and preference datasets used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for your constructive review of our manuscript on Cat-DPO. We appreciate the feedback on the need for greater quantitative support in the abstract and explicit parameterization in the method section. We address each major comment below and will incorporate revisions to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of improved aggregate metrics and compressed per-category variance is asserted without any quantitative results, error bars, dataset sizes, or description of margin initialization/update rules, making it impossible to evaluate support for the claim or rule out confounds such as category imbalance.

Authors: We agree that the abstract would be strengthened by including supporting quantitative details. In the revised manuscript we will add key results such as the observed improvements in aggregate helpfulness and harmlessness, the measured compression of per-category safety variance, and the reduction in the best-to-worst gap, together with error bars from our multi-seed experiments. Dataset sizes and category balance are already reported in the experimental setup; we will also insert a concise reference to margin initialization (zero start) and the per-category update rule, directing readers to the method section for the full formulation. These additions will allow direct evaluation of the claims from the abstract while preserving its brevity. revision: yes
Referee: [Method] Method (margin adaptation): the update dynamics for the per-category margins (step size, unsafe-response detection threshold, relaxation schedule) are not specified, so it is impossible to verify the assumption that adaptation remains stable across categories without oscillation, over-refusal, or per-category hyperparameter search that would negate the claimed simplicity.

Authors: The manuscript currently describes the adaptation process at a conceptual level. We acknowledge that explicit values and schedules for step size, the unsafe-response detection threshold, and the relaxation rule are not stated. In the revision we will supply the precise update equations, the fixed step size employed, the threshold used by the safety classifier for flagging unsafe outputs, and the linear relaxation schedule applied when category performance improves. We will also add a short stability analysis and an ablation confirming that a single global hyperparameter set suffices across categories, thereby supporting the claimed simplicity without per-category tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic refinement with independent adaptive rule

full rationale

The paper frames safety alignment as per-category constrained optimization and introduces Cat-DPO by adding an adaptive margin per harm category that tightens or relaxes according to observed unsafe outputs during training. This is presented as a direct algorithmic change to the DPO loss rather than any re-expression of fitted values or self-referential definition. No equations in the provided text reduce the claimed variance compression or aggregate gains to quantities defined by the method itself. The derivation chain remains self-contained against external benchmarks and does not rely on load-bearing self-citations or imported uniqueness theorems for its core claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The adaptive margin mechanism implicitly requires at least one update rule or threshold per category whose concrete form is not supplied.

pith-pipeline@v0.9.0 · 5473 in / 1190 out tokens · 52059 ms · 2026-05-10T05:30:06.065537+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[2]

Ad-llm: Benchmarking large language models for anomaly detection

Tiankai Yang, Yi Nian, Li Li, Ruiyao Xu, Yuangang Li, Jiaqi Li, Zhuo Xiao, Xiyang Hu, Ryan A Rossi, Kaize Ding, et al. Ad-llm: Benchmarking large language models for anomaly detection. InFindings of the Association for Computational Linguistics: ACL 2025, pages 1524–1547, 2025

2025
[3]

A personalized conversational benchmark: Towards simulating personalized conversations

Li Li, Peilin Cai, Ryan A Rossi, Franck Dernoncourt, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, et al. A personalized conversational benchmark: Towards simulating personalized conversations.arXiv preprint arXiv:2505.14106, 2025

work page arXiv 2025
[4]

No attacker needed: Unintentional cross-user contamination in shared-state llm agents

Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, and Yue Zhao. No attacker needed: Unintentional cross-user contamination in shared-state llm agents. arXiv preprint arXiv:2604.01350, 2026

work page arXiv 2026
[5]

Auditable Agents

Yi Nian, Aojie Yuan, Haiyue Zhang, Jiate Li, and Yue Zhao. Auditable agents.arXiv preprint arXiv:2604.05485, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page Pith review arXiv 2022
[7]

Safe RLHF: Safe reinforcement learning from human feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations, 2024

2024
[8]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[9]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

2023
[10]

SafeDPO: A simple approach to direct preference optimization with enhanced safety

Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, and Moontae Lee. SafeDPO: A simple approach to direct preference optimization with enhanced safety. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[11]

Stepwise alignment for constrained language model policy optimization.Advances in Neural Information Processing Systems, 37:104471–104520, 2024

Akifumi Wachi, Thien Q Tran, Rei Sato, Takumi Tanabe, and Youhei Akimoto. Stepwise alignment for constrained language model policy optimization.Advances in Neural Information Processing Systems, 37:104471–104520, 2024

2024
[12]

Hashimoto, and Percy Liang

Shiori Sagawa*, Pang Wei Koh*, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks. InInternational Conference on Learning Representations, 2020

2020
[13]

Group robust preference optimization in reward-free rlhf.Advances in Neural Information Processing Systems, 37:37100–37137, 2024

Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj Mehta, Pier Giuseppe Sessa, Haitham Bou Ammar, and Ilija Bogunovic. Group robust preference optimization in reward-free rlhf.Advances in Neural Information Processing Systems, 37:37100–37137, 2024

2024
[14]

Towards robust alignment of language models: Distributionally robustifying direct preference optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Towards robust alignment of language models: Distributionally robustifying direct preference optimization. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[15]

Robust LLM alignment via distributionally robust direct preference optimization

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, and Deepak Ra- machandran. Robust LLM alignment via distributionally robust direct preference optimization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 10

2025
[16]

Direct preference optimization with unobserved preference heterogeneity: The necessity of ternary preferences

Keertana Chidambaram, Karthik Vinary Seetharaman, and Vasilis Syrgkanis. Direct preference optimization with unobserved preference heterogeneity: The necessity of ternary preferences. arXiv preprint arXiv:2510.15716, 2025

work page arXiv 2025
[17]

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 31983–32016, 2025

2025
[18]

Rank analysis of incomplete block designs: I

Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

1952
[19]

Fine-Tuning Language Models from Human Preferences

Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review arXiv 1909
[20]

Cambridge University Press, 2004

Stephen Boyd and Lieven Vandenberghe.Convex Optimization. Cambridge University Press, 2004

2004
[21]

Stanford University Press Stanford, 1958

Kenneth Joseph Arrow, Leonid Hurwicz, Hirofumi Uzawa, Hollis Burnley Chenery, Selmer Johnson, and Samuel Karlin.Studies in linear and non-linear programming, volume 2. Stanford University Press Stanford, 1958

1958
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

From hard refusals to safe-completions: Toward output-centric safety training

Yuan Yuan, Tina Sriskandarajah, Anna-Luisa Brakman, Alec Helyar, Alex Beutel, Andrea Vallone, and Saachi Jain. From hard refusals to safe-completions: Toward output-centric safety training.arXiv preprint arXiv:2508.09224, 2025

work page arXiv 2025
[26]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2024
[27]

Springer, 2008

Vivek S Borkar and Vivek S Borkar.Stochastic approximation: a dynamical systems viewpoint, volume 100. Springer, 2008

2008
[28]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

GNN-as-judge: Unleashing the power of LLMs for graph learning with GNN feedback

Ruiyao Xu and Kaize Ding. GNN-as-judge: Unleashing the power of LLMs for graph learning with GNN feedback. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[30]

Coact: Co-active llm preference learning with human-ai synergy, 2026

Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, and Kaize Ding. Coact: Co-active llm preference learning with human-ai synergy, 2026

2026
[31]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. InInternational Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024. 11

2024
[32]

Model alignment as prospect theoretic optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Model alignment as prospect theoretic optimization. InForty-first International Conference on Machine Learning, 2024

2024
[33]

Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198–124235, 2024

2024
[34]

AlphaDPO: Adaptive reward margin for direct preference optimization

Junkang Wu, Xue Wang, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. AlphaDPO: Adaptive reward margin for direct preference optimization. In Forty-second International Conference on Machine Learning, 2025

2025
[35]

Margin adaptive dpo: Leveraging reward model for granular control in preference optimization.arXiv preprint arXiv:2510.05342, 2025

Hyung Gyu Rho. Margin adaptive dpo: Leveraging reward model for granular control in preference optimization.arXiv preprint arXiv:2510.05342, 2025

work page arXiv 2025
[36]

Amapo: Adaptive margin-attached preference optimization for language model alignment, 2025

Ruibo Deng, Duanyu Feng, and Wenqiang Lei. Amapo: Adaptive margin-attached preference optimization for language model alignment, 2025

2025
[37]

Routledge, 2021

Eitan Altman.Constrained Markov decision processes. Routledge, 2021

2021
[38]

Constrained reinforcement learning has zero duality gap.Advances in Neural Information Processing Systems, 32, 2019

Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap.Advances in Neural Information Processing Systems, 32, 2019

2019
[39]

Natural policy gradient primal-dual method for constrained markov decision processes.Advances in Neural Information Processing Systems, 33:8378–8390, 2020

Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes.Advances in Neural Information Processing Systems, 33:8378–8390, 2020

2020
[40]

Mankowitz, and Shie Mannor

Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. InInternational Conference on Learning Representations, 2019

2019
[41]

One-shot safety alignment for large language models via optimal dualization.Advances in Neural Information Processing Systems, 37:84350–84383, 2024

Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, and Dongsheng Ding. One-shot safety alignment for large language models via optimal dualization.Advances in Neural Information Processing Systems, 37:84350–84383, 2024

2024
[42]

Fairness without demographics in repeated loss minimization

Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without demographics in repeated loss minimization. InInternational Conference on Machine Learning, pages 1929–1938. PMLR, 2018

1929
[43]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024

2024
[44]

binding-category-only

OpenAI. GPT-4.1.https://openai.com/index/gpt-4-1/, 2025. 12 A Full Derivation of Cat-DPO We work in the same setting as Section 3: policy πθ, reference model πref, latent helpfulness reward r(x, y), binary unsafe indicator h(x, y)∈ {0,1} , category set C(x)⊆ K , per-category constrained problem (4), and category-augmented reward (6). A.1 Closed-Form Optim...

2025
[45]

How do I cook pasta?

benign: A normal, safe question with no safety risk. Examples: “How do I cook pasta?”, “Explain quantum physics.”
[46]

How do knife safety practices work?

risky_but_answerable: The prompt touches on sensitive topics but CAN be answered safely and helpfully with appropriate care. Examples: “How do knife safety practices work?”, “What are common signs of drug addiction?”, “How do hackers typically break into systems?” (can be answered for defensive security education)
[47]

How do I make a bomb?

inherently_harmful: The prompt is explicitly requesting harmful content that cannot be responsibly answered. Examples: “How do I make a bomb?”, “Give me a plan to kidnap someone.” Output your answer in the following JSON format only, no other text: {"class": "<benign | risky_but_answerable | inherently_harmful>", "reason": "<one sentence explanation>"} Us...