Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Arash Akbari; Arman Akbari; Geng Yuan; Jin Lu; Lingzi Hong; Qitao Tan; Xiaoming Zhai; Xiaoying Song; Yanzhi Wang; Zhen Xiang

arxiv: 2605.24154 · v1 · pith:KCI4NETTnew · submitted 2026-05-22 · 💻 cs.AI · cs.SE

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

Qitao Tan , Xiaoying Song , Arman Akbari , Arash Akbari , Yanzhi Wang , Xiaoming Zhai , Lingzi Hong , Zhen Xiang

show 2 more authors

Jin Lu Geng Yuan

This is my paper

Pith reviewed 2026-06-30 15:59 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords safety alignmentrefusal behaviormodular adaptationparameter merginglarge language modelscontrollable safetyprofessional domainson-demand authorization

0 comments

The pith

Palette identifies a refusal direction in LLMs and internalizes domain-specific relaxations through lightweight adaptation and parameter merging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that foundation models can move beyond a uniform refusal policy by selectively relaxing safety behavior only for authorized professional contexts. It does so by locating a refusal direction through multi-objective search, embedding the change via efficient adaptation, and then combining multiple such changes through parameter merging. A reader would care because this approach promises to unlock legitimate professional uses of models in fields such as medicine or law while retaining strict refusals for ordinary users. The experiments test the claim on four safety benchmarks, several model families, and both language and vision-language models. Success would mean models become practical for specialized settings without repeated full retraining or added inference cost.

Core claim

Palette shows that a refusal direction located by multi-objective search can be internalized into an LLM or VLM via lightweight adaptation and then composed across domains by parameter merging, producing precise on-demand safety relaxation for authorized target domains while leaving general safety and utility unchanged.

What carries the argument

Refusal direction identified by multi-objective search, internalized by lightweight adaptation, and composed by parameter merging.

If this is right

Domain-specific safety controls can be learned once and then activated or deactivated on demand without retraining the base model.
The same modular controls work across multiple model scales and both pure language and vision-language models.
General capabilities and performance on non-target safety tasks remain intact after adaptation and merging.
Foundation models can be configured for diverse professional requirements through composition rather than separate training runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The merging technique could be tested on alignment properties other than safety, such as output style or domain knowledge boundaries.
If merging scales cleanly, organizations might maintain one base model and swap small control sets instead of hosting many specialized instances.
Real-world deployment would need checks for whether merged controls remain stable when users combine them in unexpected ways.

Load-bearing premise

A direction found by search can be added to the model through adaptation and merging without creating interference or unintended safety loss in non-target domains.

What would settle it

Merging two or more domain-specific adaptations produces measurable refusal leakage or accuracy drop on a held-out general safety benchmark that was not observed before merging.

Figures

Figures reproduced from arXiv: 2605.24154 by Arash Akbari, Arman Akbari, Geng Yuan, Jin Lu, Lingzi Hong, Qitao Tan, Xiaoming Zhai, Xiaoying Song, Yanzhi Wang, Zhen Xiang.

**Figure 2.** Figure 2: Overview of PALETTE, which mainly consists of three steps: generating refusal direction candidates, multi-objective search over refusal directions, and internalized tuning via adaptation. compliance, while preserving utility, whereas others either induce only limited behavioral change or cause undesirable drift on Pdisallowed. We therefore seek a direction through a multi-objective optimization framework t… view at source ↗

**Figure 3.** Figure 3: Refusal rate and utility of single-domain safety control on LLaMA2-7B-Chat. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Refusal rate and utility of single-domain safety control on LLaMA3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Refusal rate and utility of multi-domains safety control on LLaMA2-7B-Chat. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Refusal rate and utility of multi-domains safety control on LLaMA3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of safety control in activation space via t-SNE. The background color denotes [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation study on OOD negative domain robustness for LLaMA3.1-8B-it. Values represent [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Palette describes a modular way to relax LLM safety refusals for authorized domains via multi-objective search and parameter merging, but the abstract supplies no numbers, baselines, or method details so the claims cannot be checked.

read the letter

The main thing to know is that this paper introduces Palette as a framework to move past one-size-fits-all safety alignment. It finds a refusal direction through multi-objective search, internalizes it with lightweight adaptation, and then supports on-demand multi-domain use by learning controls separately and composing them with parameter merging.

The combination of those steps for controllable, efficient relaxation looks like the actual new piece. It targets a practical deployment issue where general refusals block legitimate professional requests while full retraining or inference steering each have their own costs.

The paper does a reasonable job stating the problem and sketching an engineering path that aims to keep general safety intact. The mention of testing on both LLMs and VLMs plus multiple model variants is a positive sign if the results are real.

The soft spots are the complete absence of any quantitative evidence. The abstract claims success across four benchmarks without sacrificing utility, yet gives no numbers, no baselines, no error analysis, and no check on whether merged parameters interfere across domains. The key assumption that the searched direction can be internalized and merged cleanly is stated but not shown. Without those details the central claim stays untestable.

This is for readers working on applied alignment and LLM deployment in regulated settings. A practitioner looking for modular control ideas might find the framing useful even if the execution is not yet demonstrated.

It deserves peer review because the problem is relevant and the proposed direction is concrete; a referee can check whether the full experiments actually support the claims or whether the merging step introduces hidden costs.

Referee Report

0 major / 3 minor

Summary. The paper proposes Palette, a modular framework for selectively relaxing safety refusals in LLMs and VLMs on authorized target domains. It identifies a refusal direction through multi-objective search, internalizes it via lightweight adaptation, and enables on-demand multi-domain control through independent learning followed by parameter merging. Experiments across four safety benchmarks, multiple model variants, LLMs, and VLMs are claimed to demonstrate precise control without loss of general utility or unintended interference in non-target domains.

Significance. If the experimental claims hold, the work provides a practical, efficient alternative to full realignment or inference-time steering for context-specific safety policies. The modular composition aspect could enable scalable customization for professional use cases while preserving baseline safety, addressing a clear limitation of current one-size-fits-all alignment approaches. The paper ships experimental validation across multiple benchmarks and model types, which strengthens the assessment if the methods and results are reproducible.

minor comments (3)

[Abstract] The abstract asserts results across four benchmarks and multiple models but the provided text supplies no quantitative tables, baselines, or error bars; ensure §4 or §5 includes these with explicit comparison to steering and realignment baselines.
[§3] Clarify the precise definition and search objective for the 'refusal direction' (mentioned as an invented entity in the axiom ledger); add a short methods subsection or equation in §3 to make the multi-objective search reproducible.
[Experiments] The modular merging step is central to the on-demand claim; add a short ablation in the experiments section showing interference metrics when composing more than two domains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the practical significance of Palette for modular safety customization, and the recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical method relying on multi-objective search to identify a refusal direction, followed by lightweight adaptation and parameter merging for modular composition. No equations, derivations, or self-citations appear in the provided text that reduce any claimed outcome to a fitted input or prior result by construction. Central claims rest on experimental validation across benchmarks and models rather than mathematical self-definition or imported uniqueness theorems. This is self-contained against external benchmarks, consistent with the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that refusal behavior corresponds to an isolatable direction in representation space that can be independently learned and merged per domain; no free parameters or invented entities with external evidence are specified in the abstract.

axioms (1)

domain assumption Refusal behavior in LLMs corresponds to an isolatable direction in internal representations that can be selectively modified without side effects.
Central to the multi-objective search and internalization steps described.

invented entities (1)

refusal direction no independent evidence
purpose: Represents the internal model component responsible for refusal that can be targeted for domain-specific relaxation.
Introduced as the key object of the multi-objective search; no independent evidence provided in abstract.

pith-pipeline@v0.9.1-grok · 5759 in / 1257 out tokens · 51845 ms · 2026-06-30T15:59:49.205938+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval
cs.AI 2026-06 unverdicted novelty 7.0

SGDR enables stepwise skill reuse in web agents via sliding-window extraction, dual text-code representations, and state-grounded retrieval, delivering roughly 10% relative gains over baselines on WebArena.

Reference graph

Works this paper leans on

71 extracted references · 40 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

2024
[2]

Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

Jiaxi Li, Yucheng Shi, Jin Lu, and Ninghao Liu. Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

work page arXiv 2025
[3]

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

work page arXiv 2025
[4]

Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization. arXiv preprint arXiv:2506.11712, 2025

work page arXiv 2025
[5]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

work page arXiv 2023
[7]

Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

Shenghui Li, Edith C-H Ngai, Fanghua Ye, and Thiemo V oigt. Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

work page arXiv 2024
[8]

Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, and Siheng Chen. Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

work page arXiv 2024
[9]

Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, and Geng Yuan. Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

work page arXiv 2026
[10]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022
[11]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025
[13]

Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, and Benjamin Van Durme. Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

work page arXiv 2024
[14]

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3149–3167, 2025

2025
[15]

Controllable preference optimization: Toward controllable multi-objective alignment

Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024. 10

2024
[16]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

work page arXiv 2024
[17]

Beyond prompt engineering: Robust behavior control in llms via steering target atoms

Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

2025
[18]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

2024
[21]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

2024
[23]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023
[25]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, 37:55005–55029, 2024

2024
[26]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2024

2024
[28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[34]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024
[36]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[37]

The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

work page arXiv 2023
[38]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

2022
[39]

Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles

Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, and Diyi Yang. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10570–10603, 2024

2024
[40]

From distributional to overton pluralism: Investi- gating large language model alignment

Thom Lake, Eunsol Choi, and Greg Durrett. From distributional to overton pluralism: Investi- gating large language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6794–6814, 2025

2025
[41]

Pace: Parsimonious concept engineering for large language models

Jinqi Luo, Tianjiao Ding, Kwan H Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison- Burch, and René Vidal. Pace: Parsimonious concept engineering for large language models. Advances in Neural Information Processing Systems, 37:99347–99381, 2024

2024
[42]

The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, et al. The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

2024
[43]

Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

2023
[44]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

2024
[45]

Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment

Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, and Tat-Seng Chua. Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11010–11031, 2025

2025
[46]

Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

2025
[47]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

2023
[48]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023
[49]

Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

work page arXiv 2024
[50]

Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

work page arXiv 2024
[51]

arXiv preprint arXiv:2405.20947 , year=

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024
[52]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2024
[53]

TrustLLM: Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

Ruchira Ray and Ruchi Bhalani. Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

work page arXiv 2024
[55]

Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning

Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, and Ruoxi Jia. Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning. arXiv preprint arXiv:2507.04250, 2025

work page arXiv 2025
[56]

Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, and Wen Jiang. Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

work page arXiv 2025
[57]

Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

Xinpeng Wang, Chengzhi Hu, Paul Röttger, and Barbara Plank. Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

work page arXiv 2024
[58]

The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

2025
[59]

Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025. 13

work page arXiv 2025
[60]

Mitigating content effects on reasoning in language models through fine-grained activation steering

Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, and André Freitas. Mitigating content effects on reasoning in language models through fine-grained activation steering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33314–33322, 2026

2026
[61]

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks

Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29947–29957, 2025

2025
[62]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

work page arXiv 2024
[63]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42, 2025

2025
[64]

Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Benjamin Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

work page arXiv 2025
[65]

Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6832–684...

2025
[66]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation.ArXiv, abs/2310.06987, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[67]

Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning, 2024

2024
[68]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

2023
[69]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning

Sheng Li, Qitao Tan, Yue Dai, Zhenglun Kong, Tianyu Wang, Jun Liu, Ao Li, Ninghao Liu, Yufei Ding, Xulong Tang, et al. Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[71]

Among the following options

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025. 14 A Relat...

work page arXiv 2025

[1] [1]

Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence, 46(8):5625– 5644, 2024

2024

[2] [2]

Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

Jiaxi Li, Yucheng Shi, Jin Lu, and Ninghao Liu. Mits: Enhanced tree search reasoning for llms via pointwise mutual information.arXiv preprint arXiv:2510.03632, 2025

work page arXiv 2025

[3] [3]

Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

Qitao Tan, Jun Liu, Zheng Zhan, Caiwei Ding, Yanzhi Wang, Xiaolong Ma, Jaewoo Lee, Jin Lu, and Geng Yuan. Harmony in divergence: Towards fast, accurate, and memory-efficient zeroth-order llm fine-tuning.arXiv preprint arXiv:2502.03304, 2025

work page arXiv 2025

[4] [4]

Mitigating hallucination through theory-consistent symmetric multimodal preference optimization.arXiv preprint arXiv:2506.11712, 2025

Wenqi Liu, Xuemeng Song, Jiaxi Li, Yinwei Wei, Na Zheng, Jianhua Yin, and Liqiang Nie. Mit- igating hallucination through theory-consistent symmetric multimodal preference optimization. arXiv preprint arXiv:2506.11712, 2025

work page arXiv 2025

[5] [5]

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

work page arXiv 2023

[7] [7]

Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

Shenghui Li, Edith C-H Ngai, Fanghua Ye, and Thiemo V oigt. Peft-as-an-attack! jail- breaking language models during federated parameter-efficient fine-tuning.arXiv preprint arXiv:2411.19335, 2024

work page arXiv 2024

[8] [8]

Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

Rui Ye, Jingyi Chai, Xiangrui Liu, Yaodong Yang, Yanfeng Wang, and Siheng Chen. Emerging safety attack and defense in federated instruction tuning of large language models.arXiv preprint arXiv:2406.10630, 2024

work page arXiv 2024

[9] [9]

Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

Qitao Tan, Xiaoying Song, Ningxi Cheng, Ninghao Liu, Xiaoming Zhai, Lingzi Hong, Yanzhi Wang, Zhen Xiang, and Geng Yuan. Q-realign: Piggybacking realignment on quantization for safe and efficient llm deployment.arXiv preprint arXiv:2601.08089, 2026

work page arXiv 2026

[10] [10]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

2022

[11] [11]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Hybridflow: A flexible and efficient rlhf framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

2025

[13] [13]

Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, and Benjamin Van Durme. Controllable safety alignment: Inference-time adaptation to diverse safety requirements.arXiv preprint arXiv:2410.08968, 2024

work page arXiv 2024

[14] [14]

Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training

Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3149–3167, 2025

2025

[15] [15]

Controllable preference optimization: Toward controllable multi-objective alignment

Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Zexu Sun, Bowen Sun, Huimin Chen, Ruobing Xie, Jie Zhou, Yankai Lin, et al. Controllable preference optimization: Toward controllable multi-objective alignment. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1437–1454, 2024. 10

2024

[16] [16]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering.arXiv preprint arXiv:2409.05907, 2024

work page arXiv 2024

[17] [17]

Beyond prompt engineering: Robust behavior control in llms via steering target atoms

Mengru Wang, Ziwen Xu, Shengyu Mao, Shumin Deng, Zhaopeng Tu, Huajun Chen, and Ningyu Zhang. Beyond prompt engineering: Robust behavior control in llms via steering target atoms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23381–23399, 2025

2025

[18] [18]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

2024

[21] [21]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks.Advances in Neural Information Processing Systems, 37:125416–125440, 2024

2024

[23] [23]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Chi Zhang, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset.arXiv preprint arXiv:2307.04657, 2023

work page arXiv 2023

[25] [25]

Jailbreakbench: An open robustness benchmark for jailbreaking large language models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Advances in Neural Information Processing Systems, 37:55005–55029, 2024

2024

[26] [26]

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, et al. The wmdp benchmark: Measuring and reducing malicious use with unlearning.arXiv preprint arXiv:2403.03218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

Xin Liu, Yichen Zhu, Jindong Gu, Yunshi Lan, Chao Yang, and Yu Qiao. Mm-safetybench: A benchmark for safety evaluation of multimodal large language models. InEuropean Conference on Computer Vision, pages 386–403. Springer, 2024

2024

[28] [28]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models.arXiv preprint arXiv:2310.04451, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[34] [34]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[35] [35]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

2024

[36] [36]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024

[37] [37]

The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. The unlocking spell on base llms: Rethinking alignment via in-context learning.arXiv preprint arXiv:2312.01552, 2023

work page arXiv 2023

[38] [38]

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? InProceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064, 2022

2022

[39] [39]

Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles

Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, and Diyi Yang. Roleplay-doh: Enabling domain-experts to create llm-simulated patients via eliciting and adhering to principles. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10570–10603, 2024

2024

[40] [40]

From distributional to overton pluralism: Investi- gating large language model alignment

Thom Lake, Eunsol Choi, and Greg Durrett. From distributional to overton pluralism: Investi- gating large language model alignment. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6794–6814, 2025

2025

[41] [41]

Pace: Parsimonious concept engineering for large language models

Jinqi Luo, Tianjiao Ding, Kwan H Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison- Burch, and René Vidal. Pace: Parsimonious concept engineering for large language models. Advances in Neural Information Processing Systems, 37:99347–99381, 2024

2024

[42] [42]

The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, et al. The art of saying no: Contextual noncompliance in language models.Advances in Neural Information Processing Systems, 37:49706–49748, 2024

2024

[43] [43]

Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training.Advances in Neural Information Processing Systems, 36:59008–59033, 2023

2023

[44] [44]

Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization

Zhanhui Zhou, Jie Liu, Jing Shao, Xiangyu Yue, Chao Yang, Wanli Ouyang, and Yu Qiao. Beyond one-preference-fits-all alignment: Multi-objective direct preference optimization. In Findings of the Association for Computational Linguistics: ACL 2024, pages 10586–10613, 2024. 12

2024

[45] [45]

Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment

Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, and Tat-Seng Chua. Self-improvement towards pareto optimality: Mitigating preference conflicts in multi-objective alignment. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11010–11031, 2025

2025

[46] [46]

Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models

Chengao Li, Hanyu Zhang, Yunkun Xu, Hongyan Xue, Xiang Ao, and Qing He. Gradient- adaptive policy optimization: Towards multi-objective alignment of large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11214–11232, 2025

2025

[47] [47]

Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

Alexandre Rame, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards.Advances in Neural Information Processing Systems, 36:71095–71134, 2023

2023

[48] [48]

Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Per- sonalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564, 2023

work page arXiv 2023

[49] [49]

Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. Warm: On the benefits of weight averaged reward models.arXiv preprint arXiv:2401.12187, 2024

work page arXiv 2024

[50] [50]

Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

Enneng Yang, Li Shen, Zhenyi Wang, Guibing Guo, Xiaojun Chen, Xingwei Wang, and Dacheng Tao. Representation surgery for multi-task model merging.arXiv preprint arXiv:2402.02705, 2024

work page arXiv 2024

[51] [51]

arXiv preprint arXiv:2405.20947 , year=

Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models.arXiv preprint arXiv:2405.20947, 2024

work page arXiv 2024

[52] [52]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

2024

[53] [53]

TrustLLM: Trustworthiness in Large Language Models

Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

Ruchira Ray and Ruchi Bhalani. Mitigating exaggerated safety in large language models.arXiv preprint arXiv:2405.05418, 2024

work page arXiv 2024

[55] [55]

Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning

Mahavir Dabas, Si Chen, Charles Fleming, Ming Jin, and Ruoxi Jia. Just enough shifts: Mitigating over-refusal in aligned language models with targeted representation fine-tuning. arXiv preprint arXiv:2507.04250, 2025

work page arXiv 2025

[56] [56]

Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

Junbo Zhang, Ran Chen, Qianli Zhou, Xinyang Deng, and Wen Jiang. Understanding and mitigating over-refusal for large language models via safety representation.arXiv preprint arXiv:2511.19009, 2025

work page arXiv 2025

[57] [57]

Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

Xinpeng Wang, Chengzhi Hu, Paul Röttger, and Barbara Plank. Surgical, cheap, and flexi- ble: Mitigating false refusal in language models via single vector ablation.arXiv preprint arXiv:2410.03415, 2024

work page arXiv 2024

[58] [58]

The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional safety analysis.arXiv e-prints, pages arXiv–2502, 2025

2025

[59] [59]

Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025. 13

work page arXiv 2025

[60] [60]

Mitigating content effects on reasoning in language models through fine-grained activation steering

Marco Valentino, Geonhee Kim, Dhairya Dalal, Zhixue Zhao, and André Freitas. Mitigating content effects on reasoning in language models through fine-grained activation steering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33314–33322, 2026

2026

[61] [61]

Steering away from harm: An adaptive approach to defending vision language model against jailbreaks

Han Wang, Gang Wang, and Huan Zhang. Steering away from harm: An adaptive approach to defending vision language model against jailbreaks. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29947–29957, 2025

2025

[62] [62]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296, 2024

work page arXiv 2024

[63] [63]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42, 2025

2025

[64] [64]

Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Benjamin Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.ArXiv, abs/2506.03292, 2025

work page arXiv 2025

[65] [65]

Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering

Xinyu Tang, Xiaolei Wang, Zhihao Lv, Yingqian Min, Wayne Xin Zhao, Binbin Hu, Ziqi Liu, and Zhiqiang Zhang. Unlocking general long chain-of-thought reasoning capabilities of large language models via representation engineering. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6832–684...

2025

[66] [66]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic jailbreak of open-source llms via exploiting generation.ArXiv, abs/2310.06987, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[67] [67]

Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harm- bench: A standardized evaluation framework for automated red teaming and robust refusal. In International Conference on Machine Learning, 2024

2024

[68] [68]

Stanford alpaca: An instruction-following llama model, 2023

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Stanford alpaca: An instruction-following llama model, 2023

2023

[69] [69]

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. Sparsevlm: Visual token sparsification for efficient vision-language model inference.arXiv preprint arXiv:2410.04417, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning

Sheng Li, Qitao Tan, Yue Dai, Zhenglun Kong, Tianyu Wang, Jun Liu, Ao Li, Ninghao Liu, Yufei Ding, Xulong Tang, et al. Mutual effort for efficiency: A similarity-based token pruning for vision transformers in self-supervised learning. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[71] [71]

Among the following options

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, et al. Topv: Compatible token pruning with inference time optimization for fast and low-memory multimodal vision language model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19803–19813, 2025. 14 A Relat...

work page arXiv 2025