pith. sign in

arxiv: 2605.24154 · v1 · pith:KCI4NETTnew · submitted 2026-05-22 · 💻 cs.AI · cs.SE

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Pith reviewed 2026-06-30 15:59 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords safety alignmentrefusal behaviormodular adaptationparameter merginglarge language modelscontrollable safetyprofessional domainson-demand authorization
0
0 comments X

The pith

Palette identifies a refusal direction in LLMs and internalizes domain-specific relaxations through lightweight adaptation and parameter merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that foundation models can move beyond a uniform refusal policy by selectively relaxing safety behavior only for authorized professional contexts. It does so by locating a refusal direction through multi-objective search, embedding the change via efficient adaptation, and then combining multiple such changes through parameter merging. A reader would care because this approach promises to unlock legitimate professional uses of models in fields such as medicine or law while retaining strict refusals for ordinary users. The experiments test the claim on four safety benchmarks, several model families, and both language and vision-language models. Success would mean models become practical for specialized settings without repeated full retraining or added inference cost.

Core claim

Palette shows that a refusal direction located by multi-objective search can be internalized into an LLM or VLM via lightweight adaptation and then composed across domains by parameter merging, producing precise on-demand safety relaxation for authorized target domains while leaving general safety and utility unchanged.

What carries the argument

Refusal direction identified by multi-objective search, internalized by lightweight adaptation, and composed by parameter merging.

If this is right

  • Domain-specific safety controls can be learned once and then activated or deactivated on demand without retraining the base model.
  • The same modular controls work across multiple model scales and both pure language and vision-language models.
  • General capabilities and performance on non-target safety tasks remain intact after adaptation and merging.
  • Foundation models can be configured for diverse professional requirements through composition rather than separate training runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The merging technique could be tested on alignment properties other than safety, such as output style or domain knowledge boundaries.
  • If merging scales cleanly, organizations might maintain one base model and swap small control sets instead of hosting many specialized instances.
  • Real-world deployment would need checks for whether merged controls remain stable when users combine them in unexpected ways.

Load-bearing premise

A direction found by search can be added to the model through adaptation and merging without creating interference or unintended safety loss in non-target domains.

What would settle it

Merging two or more domain-specific adaptations produces measurable refusal leakage or accuracy drop on a held-out general safety benchmark that was not observed before merging.

Figures

Figures reproduced from arXiv: 2605.24154 by Arash Akbari, Arman Akbari, Geng Yuan, Jin Lu, Lingzi Hong, Qitao Tan, Xiaoming Zhai, Xiaoying Song, Yanzhi Wang, Zhen Xiang.

Figure 1
Figure 1. Figure 1: Illustration of the desired refusal relaxation for an authorized target domain. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PALETTE, which mainly consists of three steps: generating refusal direction candidates, multi-objective search over refusal directions, and internalized tuning via adaptation. compliance, while preserving utility, whereas others either induce only limited behavioral change or cause undesirable drift on Pdisallowed. We therefore seek a direction through a multi-objective optimization framework t… view at source ↗
Figure 3
Figure 3. Figure 3: Refusal rate and utility of single-domain safety control on LLaMA2-7B-Chat. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refusal rate and utility of single-domain safety control on LLaMA3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Refusal rate and utility of multi-domains safety control on LLaMA2-7B-Chat. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Refusal rate and utility of multi-domains safety control on LLaMA3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of safety control in activation space via t-SNE. The background color denotes [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation study on OOD negative domain robustness for LLaMA3.1-8B-it. Values represent [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper proposes Palette, a modular framework for selectively relaxing safety refusals in LLMs and VLMs on authorized target domains. It identifies a refusal direction through multi-objective search, internalizes it via lightweight adaptation, and enables on-demand multi-domain control through independent learning followed by parameter merging. Experiments across four safety benchmarks, multiple model variants, LLMs, and VLMs are claimed to demonstrate precise control without loss of general utility or unintended interference in non-target domains.

Significance. If the experimental claims hold, the work provides a practical, efficient alternative to full realignment or inference-time steering for context-specific safety policies. The modular composition aspect could enable scalable customization for professional use cases while preserving baseline safety, addressing a clear limitation of current one-size-fits-all alignment approaches. The paper ships experimental validation across multiple benchmarks and model types, which strengthens the assessment if the methods and results are reproducible.

minor comments (3)
  1. [Abstract] The abstract asserts results across four benchmarks and multiple models but the provided text supplies no quantitative tables, baselines, or error bars; ensure §4 or §5 includes these with explicit comparison to steering and realignment baselines.
  2. [§3] Clarify the precise definition and search objective for the 'refusal direction' (mentioned as an invented entity in the axiom ledger); add a short methods subsection or equation in §3 to make the multi-objective search reproducible.
  3. [Experiments] The modular merging step is central to the on-demand claim; add a short ablation in the experiments section showing interference metrics when composing more than two domains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical significance of Palette for modular safety customization, and the recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical method relying on multi-objective search to identify a refusal direction, followed by lightweight adaptation and parameter merging for modular composition. No equations, derivations, or self-citations appear in the provided text that reduce any claimed outcome to a fitted input or prior result by construction. Central claims rest on experimental validation across benchmarks and models rather than mathematical self-definition or imported uniqueness theorems. This is self-contained against external benchmarks, consistent with the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that refusal behavior corresponds to an isolatable direction in representation space that can be independently learned and merged per domain; no free parameters or invented entities with external evidence are specified in the abstract.

axioms (1)
  • domain assumption Refusal behavior in LLMs corresponds to an isolatable direction in internal representations that can be selectively modified without side effects.
    Central to the multi-objective search and internalization steps described.
invented entities (1)
  • refusal direction no independent evidence
    purpose: Represents the internal model component responsible for refusal that can be targeted for domain-specific relaxation.
    Introduced as the key object of the multi-objective search; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5759 in / 1257 out tokens · 51845 ms · 2026-06-30T15:59:49.205938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

    cs.AI 2026-06 unverdicted novelty 7.0

    SGDR enables stepwise skill reuse in web agents via sliding-window extraction, dual text-code representations, and state-grounded retrieval, delivering roughly 10% relative gains over baselines on WebArena.

Reference graph

Works this paper leans on

71 extracted references · 40 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

    Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

  2. [2]

    Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

    Jiaxi Li, Yucheng Shi, Jin Lu, and Ninghao Liu. Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

  3. [3]

    Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

    Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

  4. [4]

    Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025

    Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization. arXiv preprint arXiv:2506.11712, 2025

  5. [5]

    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

  6. [6]

    Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

  7. [7]

    Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

    Shenghui Li, Edith C-H Ngai, Fanghua Ye, and Thiemo V oigt. Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

  8. [8]

    Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

    Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, and Siheng Chen. Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

  9. [9]

    Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

    Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, and Geng Yuan. Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

  10. [10]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  11. [11]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  12. [12]

    Hybridflow: A flexible and efficient rlhf framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  13. [13]

    Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

    Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, and Benjamin Van Durme. Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

  14. [14]

    Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3149–3167, 2025

  15. [15]

    Controllable preference optimization: Toward controllable multi-objective alignment

    Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024. 10

  16. [16]

    Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

    Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

  17. [17]

    Beyond prompt engineering: Robust behavior control in llms via steering target atoms

    Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

  18. [18]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

  19. [19]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

  20. [20]

    Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

  21. [21]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  22. [22]

    A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

  23. [23]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  24. [24]

    Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

  25. [25]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, 37:55005–55029, 2024

  26. [26]

    The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024

  27. [27]

    Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

    Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2024

  28. [28]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  29. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  30. [30]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11

  31. [31]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  32. [32]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

  33. [33]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  34. [34]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  35. [35]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  36. [36]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  37. [37]

    The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

    Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

  38. [38]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

  39. [39]

    Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles

    Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, and Diyi Yang. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10570–10603, 2024

  40. [40]

    From distributional to overton pluralism: Investi- gating large language model alignment

    Thom Lake, Eunsol Choi, and Greg Durrett. From distributional to overton pluralism: Investi- gating large language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6794–6814, 2025

  41. [41]

    Pace: Parsimonious concept engineering for large language models

    Jinqi Luo, Tianjiao Ding, Kwan H Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison- Burch, and René Vidal. Pace: Parsimonious concept engineering for large language models. Advances in Neural Information Processing Systems, 37:99347–99381, 2024

  42. [42]

    The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

    Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, et al. The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

  43. [43]

    Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

  44. [44]

    Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

    Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

  45. [45]

    Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment

    Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, and Tat-Seng Chua. Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11010–11031, 2025

  46. [46]

    Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models

    Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

  47. [47]

    Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

    Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

  48. [48]

    Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

    Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

  49. [49]

    Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

    Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

  50. [50]

    Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

    Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

  51. [51]

    arXiv preprint arXiv:2405.20947 , year=

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

  52. [52]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  53. [53]

    TrustLLM: Trustworthiness in Large Language Models

    Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

  54. [54]

    Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

    Ruchira Ray and Ruchi Bhalani. Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

  55. [55]

    Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning

    Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, and Ruoxi Jia. Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning. arXiv preprint arXiv:2507.04250, 2025

  56. [56]

    Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

    Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, and Wen Jiang. Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

  57. [57]

    Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

    Xinpeng Wang, Chengzhi Hu, Paul Röttger, and Barbara Plank. Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

  58. [58]

    The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

    Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

  59. [59]

    Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

    Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025. 13

  60. [60]

    Mitigating content effects on reasoning in language models through fine-grained activation steering

    Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, and André Freitas. Mitigating content effects on reasoning in language models through fine-grained activation steering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33314–33322, 2026

  61. [61]

    Steering away from harm: An adaptive approach to defending vision language model against jailbreaks

    Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29947–29957, 2025

  62. [62]

    Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

    Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

  63. [63]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42, 2025

  64. [64]

    Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

    Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Benjamin Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

  65. [65]

    Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

    Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6832–684...

  66. [66]

    Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

    Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation.ArXiv, abs/2310.06987, 2023

  67. [67]

    Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning, 2024

  68. [68]

    Stanford alpaca: An instruction-following llama model, 2023

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

  69. [69]

    SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

  70. [70]

    Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning

    Sheng Li, Qitao Tan, Yue Dai, Zhenglun Kong, Tianyu Wang, Jun Liu, Ao Li, Ninghao Liu, Yufei Ding, Xulong Tang, et al. Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning. InThe Thirteenth International Conference on Learning Representations, 2025

  71. [71]

    Among the following options

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025. 14 A Relat...