DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Qingqing Luan; Qi Pan; Shenzhe Zhu; Tiankai Yang; Xiangliang Zhang; Yi Nian; Yudi Zhang; Yue Huang; Yue Zhao; Zelong Xu

arxiv: 2606.07678 · v1 · pith:J2AKTPXRnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

Yi Nian , Tiankai Yang , Yudi Zhang , Qi Pan , Zelong Xu , Shenzhe Zhu , Qingqing Luan , Yue Huang

show 2 more authors

Xiangliang Zhang Yue Zhao

This is my paper

Pith reviewed 2026-06-28 02:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords safety alignmentpreference data selectionDPOgeometric decompositiondata efficiencyLLM alignmentsubspace analysisdirectional signals

0 comments

The pith

Representing preference pairs as directions in representation space and decomposing them into a global anchor subspace plus residuals lets DOG-DPO recover full safety alignment from 11 percent of the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current data-selection methods for safety alignment lose directional structure by scoring pairs independently, especially when several datasets share safety signals alongside unique risks. DOG-DPO instead treats each pair as a vector direction, factors the combined geometry into one shared anchor subspace and dataset-specific residual subspaces, then picks a minimal subset that maximizes directional coverage. This selection runs without any training or teacher model. On six safety benchmarks and two model backbones the selected 11 percent of pairs produces utility-robustness trade-offs close to those obtained from the full set. The approach therefore reframes multi-dataset alignment as a geometric covering problem rather than a scalar quality-ranking problem.

Core claim

DOG-DPO represents each preference pair as a direction in model representation space, decomposes the multi-dataset collection of directions into a global anchor subspace that captures shared safety signals and dataset-specific residual subspaces that capture unique risks, and then selects a compact subset by maximizing diversity-based coverage of those directions before standard DPO training.

What carries the argument

Decomposition of multi-dataset preference directions into a global anchor subspace plus residual subspaces, followed by diversity-based subset selection on the resulting directional signals.

If this is right

The selected 11 percent subset recovers most of the safety gains of full-data training across six benchmarks.
The same selection procedure works on two different model backbones without retraining.
Selection requires no teacher model and no additional gradient steps.
The method runs substantially faster than representative selection baselines while preserving the utility-robustness trade-off.
Diversity-based coverage of directional subspaces replaces independent scalar scoring of pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directional decomposition could be applied to preference data for objectives other than safety, such as helpfulness or factual accuracy.
If the anchor subspace remains stable across model scales, a single selected subset might transfer between different base models.
The geometric framing suggests that alignment datasets possess an intrinsic low-dimensional directional structure that scalar selection methods have overlooked.

Load-bearing premise

That the directional signals carried by preference pairs can be cleanly separated into a shared global subspace and dataset-specific residuals without discarding the safety information required for alignment.

What would settle it

Train a model on the DOG-DPO-selected 11 percent subset and compare its safety benchmark scores to the model trained on the full set; if the reduced-data model shows substantially lower robustness on held-out tests, the geometric selection has lost critical information.

Figures

Figures reproduced from arXiv: 2606.07678 by Qingqing Luan, Qi Pan, Shenzhe Zhu, Tiankai Yang, Xiangliang Zhang, Yi Nian, Yudi Zhang, Yue Huang, Yue Zhao, Zelong Xu.

**Figure 1.** Figure 1: Overview of DOG-DPO. Step 1: each preference pair (x, y+, y−) is represented as a directional vector z = h + − h − in representation space, encoding the alignment direction. Step 2: an anchor basis B is extracted from the largest dataset Danchor, and per-dataset residual bases Tv capture dataset-specific variation orthogonal to the anchor. This ϕi is the concrete instantiation used in J (S) (Eq. 3); its ma… view at source ↗

**Figure 2.** Figure 2: DOG vs. DOG-D in the anchor–residual plane. DOG concentrates on Pareto-frontier samples; DOG-D spreads across the plane. 3 Experiment 3.1 Datasets We evaluate our method on a diverse set of safety and robustness benchmarks, covering both standard alignment metrics and adversarial attack scenarios. Unless otherwise noted, the representation extractor is the same frozen backbone as the downstream DPO mode… view at source ↗

**Figure 3.** Figure 3: Runtime comparison on identical preference [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Dynamic DPP behaviour across models. DOG-D maintains a positive diversity-gain gap over Random across the ranking sweep, while DOG saturates earlier. Annotations at k ∈ {5k, 10k, 30k} show the corresponding downstream safety scores. WJ-asr JBB-GPT HB-kw HB-GPT AD-GPT 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 Attack success rate (lower = safer) Anchor rotation ablation (Llama-3.2-3B, K=10k) cvalues (… view at source ↗

**Figure 5.** Figure 5: Anchor-rotation robustness. Safety metrics [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: DPO training dynamics on the DOG-D 30k subset. DPO loss decreases monotonically and the reward margin rises smoothly on both backbones [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Safety alignment for large language models relies on preference data, but current pipelines often train on large, redundant datasets. Existing data selection methods typically score each preference pair independently, collapsing directional preference information into scalar quality or diversity scores. This sample-centric view is especially limiting in multi-dataset settings, where shared safety directions coexist with dataset-specific residual risks. We propose DOG-DPO, a training-free data selection framework that treats preference pairs as structured geometric signals. DOG-DPO first represents each preference pair as a direction in model representation space. It then decomposes multi-dataset preference geometry into a global anchor subspace and dataset-specific residual subspaces. Finally, it selects subsets by maximizing diversity-based coverage, encouraging broad, non-redundant coverage of alignment directions before DPO training. Across six safety benchmarks and two model backbones, DOG-DPO achieves a strong utility-robustness trade-off using only 11% of the preference pairs. It recovers most of the safety gains of full-data training while remaining entirely teacher-free, training-free, and substantially faster than representative selection baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DOG-DPO frames safety data selection as direction geometry with global-plus-residual split, claims 11% data suffices, but the abstract gives almost no mechanics or validation so the results stay hard to assess.

read the letter

The paper's main move is to treat preference pairs as directions in representation space rather than scalar scores, then decompose multi-dataset geometry into one global anchor subspace plus dataset-specific residuals before picking a diverse subset for DPO.

That decomposition is the clearest novelty. It directly targets the shared-versus-specific structure that appears when safety data comes from several sources, which scalar methods ignore.

The practical upside is also straightforward: the whole selection step is training-free and teacher-free, and the reported outcome is that 11% of the pairs recover most of the safety performance on six benchmarks across two backbones.

The gaps are large. The abstract supplies no description of how directions are computed, how subspace rank or orthogonality is chosen, or what the diversity criterion actually measures. Without those steps the central claim cannot be checked. The assumption that residuals cleanly isolate unique risks while the global subspace holds the shared safety signal is plausible but untested in the given text, and the stress-test worry about losing critical directions looks reasonable until the full method and ablations are shown.

The work is aimed at people who already run multi-source preference pipelines and want to cut data volume. A reader focused on geometric data selection or alignment efficiency would get the most from it, provided the implementation details survive scrutiny.

It deserves peer review so the algorithm, the subspace procedure, and the experimental controls can be examined directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes DOG-DPO, a training-free geometric data selection method for DPO-based safety alignment. Preference pairs are represented as directions in model representation space; multi-dataset geometry is decomposed into a global anchor subspace plus dataset-specific residual subspaces; subsets are then chosen by maximizing diversity-based coverage of these directions. The central empirical claim is that this procedure recovers most safety gains of full-data DPO training on six benchmarks across two model backbones while using only 11% of the preference pairs.

Significance. If the subspace decomposition and diversity selection provably preserve safety-critical directions without loss, the approach would materially reduce the data and compute needed for robust alignment, offering a scalable, teacher-free alternative to existing selection heuristics.

major comments (2)

[Abstract / Method] Abstract and method description: the claim that the global-anchor-plus-residual decomposition plus diversity selection preserves alignment information is load-bearing, yet no concrete procedure is supplied for computing the directions, choosing subspace ranks, enforcing orthogonality, or validating that selected directions drive downstream safety gains; without these the 11% recovery result cannot be checked.
[Method / Experiments] The skeptic concern is not resolved in the provided text: if the global anchor absorbs shared safety signals or if residuals mix noise, the diversity metric (unspecified) may select subsets that fail to recover full-data utility-robustness trade-off; no ablation or diagnostic is described that tests this assumption.

minor comments (1)

[Method] Notation for the direction vectors and subspace projections should be defined explicitly with equations rather than prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and add necessary details and diagnostics.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the claim that the global-anchor-plus-residual decomposition plus diversity selection preserves alignment information is load-bearing, yet no concrete procedure is supplied for computing the directions, choosing subspace ranks, enforcing orthogonality, or validating that selected directions drive downstream safety gains; without these the 11% recovery result cannot be checked.

Authors: We agree that the abstract and method overview are high-level and that explicit implementation details are required for the claims to be verifiable. The manuscript describes representing pairs as directions in representation space and performing the decomposition, but does not supply the full algorithmic steps for direction computation, rank selection, orthogonality, or validation. In revision we will expand the method section with a precise procedure covering these elements and add a short validation correlating selected directions to benchmark gains. revision: yes
Referee: [Method / Experiments] The skeptic concern is not resolved in the provided text: if the global anchor absorbs shared safety signals or if residuals mix noise, the diversity metric (unspecified) may select subsets that fail to recover full-data utility-robustness trade-off; no ablation or diagnostic is described that tests this assumption.

Authors: This concern is valid and the current manuscript does not contain ablations or diagnostics that directly test whether the anchor/residual split preserves safety signals or whether the diversity criterion avoids noisy subsets. We will add an ablation study that varies the anchor/residual allocation and reports the resulting safety recovery, together with a diagnostic that compares the selected subset against random selection on the utility-robustness trade-off. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free geometric selection is self-contained

full rationale

The paper's method represents preference pairs as directions, decomposes geometry into global anchor plus residual subspaces, and selects via diversity coverage before DPO. These operations are described as procedural and training-free with no equations shown that equate a derived quantity to a fitted input by construction, no self-citation chains invoked for uniqueness or ansatz, and no renaming of known results. The 11% recovery claim is presented as an empirical outcome across benchmarks rather than a definitional tautology. This is the common honest case of a self-contained algorithmic pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies no concrete free parameters, axioms, or invented entities; the geometric representation and subspace decomposition are described at a conceptual level only.

pith-pipeline@v0.9.1-grok · 5739 in / 1120 out tokens · 52040 ms · 2026-06-28T02:25:52.208123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 9 canonical work pages · 5 internal anchors

[1]

Constitutional AI: Harmlessness from AI Feedback

Consti- tutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Alexander Bukharin and 1 others

work page internal anchor Pith review Pith/arXiv arXiv
[2]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Preprint, arXiv:2406.18510. Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, and Moontae Lee

work page arXiv
[3]

arXiv preprint arXiv:2505.20065

Safedpo: A simple approach to direct preference optimization with enhanced safety. arXiv preprint arXiv:2505.20065. Stephanie Lin, Jacob Hilton, and Owain Evans

work page arXiv
[4]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451. Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. 2024b. Less is more: High-value data selection for visual instruction tun- ing.arXiv preprint arXiv:2403.09559. Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Ji...

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2507.19672 , year=

Align- ment and safety in large language models: Safety mechanisms, training paradigms, and emerging chal- lenges.arXiv preprint arXiv:2507.19672. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks

work page arXiv
[6]

Friedrich Pukelsheim

Training language models to follow in- structions with human feedback.Advances in neural information processing systems, 35:27730–27744. Friedrich Pukelsheim. 2006.Optimal Design of Exper- iments. SIAM. Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang

2006
[7]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, and 1 others

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940. Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, and 1 others

work page arXiv
[8]

Reinforcement Learning for LLM Post-Training: A Survey

A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216. Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, and Kaize Ding

work page internal anchor Pith review Pith/arXiv arXiv
[9]

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Coact: Co- active llm preference learning with human-ai syn- ergy.arXiv preprint arXiv:2604.17501. Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding, and Yue Zhao

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Cat-DPO: Category-Adaptive Safety Alignment

Cat-dpo: Category-adaptive safety alignment.arXiv preprint arXiv:2604.17299. Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Gu- rurangan, Ujjwal Karn, Rui Hou, Madian Khabsa, and Yuning Mao. 2025a. Diversity-driven data se- lection for language model tuning through sparse au- toencoder. InICML Workshop. Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao ...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 4686–4701

Tagcos: Task-agnostic gradient clustered coreset selection for instruction tuning data. InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 4686–4701. A Related Work A.1 Data Selection for Instruction Tuning and Alignment Existing methods can be broadly categorized based on the type of signal used to guide selection. Distribut...

2025
[12]

coverage

that det(MS)controls both the worst-covered eigendi- rection ofM S and its rank deficiency. Here we make explicitwhythis quantity is a natural surro- gate for DPO training-signal coverage, and how the anchor/residual basis from §2.2 shapes what “coverage” means. Why the log-determinant is the right surrogate. We would ideally selectSto maximizeλ min(MS), ...

2050

[1] [1]

Constitutional AI: Harmlessness from AI Feedback

Consti- tutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073. Alexander Bukharin and 1 others

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Preprint, arXiv:2406.18510. Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, and Moontae Lee

work page arXiv

[3] [3]

arXiv preprint arXiv:2505.20065

Safedpo: A simple approach to direct preference optimization with enhanced safety. arXiv preprint arXiv:2505.20065. Stephanie Lin, Jacob Hilton, and Owain Evans

work page arXiv

[4] [4]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451. Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, and Ji-Rong Wen. 2024b. Less is more: High-value data selection for visual instruction tun- ing.arXiv preprint arXiv:2403.09559. Haoran Lu, Luyang Fang, Ruidong Zhang, Xinliang Li, Ji...

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2507.19672 , year=

Align- ment and safety in large language models: Safety mechanisms, training paradigms, and emerging chal- lenges.arXiv preprint arXiv:2507.19672. Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks

work page arXiv

[6] [6]

Friedrich Pukelsheim

Training language models to follow in- structions with human feedback.Advances in neural information processing systems, 35:27730–27744. Friedrich Pukelsheim. 2006.Optimal Design of Exper- iments. SIAM. Shuyao Shang, Yuntao Chen, Yuqi Wang, Yingyan Li, and Zhaoxiang Zhang

2006

[7] [7]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, and 1 others

Drivedpo: Policy learning via safety dpo for end-to-end autonomous driving.arXiv preprint arXiv:2509.17940. Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, and 1 others

work page arXiv

[8] [8]

Reinforcement Learning for LLM Post-Training: A Survey

A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more.arXiv preprint arXiv:2407.16216. Ruiyao Xu, Mihir Parmar, Tiankai Yang, Zhengyu Hu, Yue Zhao, and Kaize Ding

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

Coact: Co- active llm preference learning with human-ai syn- ergy.arXiv preprint arXiv:2604.17501. Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding, and Yue Zhao

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Cat-DPO: Category-Adaptive Safety Alignment

Cat-dpo: Category-adaptive safety alignment.arXiv preprint arXiv:2604.17299. Xianjun Yang, Shaoliang Nie, Lijuan Liu, Suchin Gu- rurangan, Ujjwal Karn, Rui Hou, Madian Khabsa, and Yuning Mao. 2025a. Diversity-driven data se- lection for language model tuning through sparse au- toencoder. InICML Workshop. Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao ...

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 4686–4701

Tagcos: Task-agnostic gradient clustered coreset selection for instruction tuning data. InFindings of the Associ- ation for Computational Linguistics: NAACL 2025, pages 4686–4701. A Related Work A.1 Data Selection for Instruction Tuning and Alignment Existing methods can be broadly categorized based on the type of signal used to guide selection. Distribut...

2025

[12] [12]

coverage

that det(MS)controls both the worst-covered eigendi- rection ofM S and its rank deficiency. Here we make explicitwhythis quantity is a natural surro- gate for DPO training-signal coverage, and how the anchor/residual basis from §2.2 shapes what “coverage” means. Why the log-determinant is the right surrogate. We would ideally selectSto maximizeλ min(MS), ...

2050