pith. sign in

arxiv: 2606.12342 · v1 · pith:JOOMBDGLnew · submitted 2026-06-10 · 💻 cs.CL · cs.AI· cs.ET· cs.LG

ALIGNBEAM : Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

Pith reviewed 2026-06-27 10:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.ETcs.LG
keywords safety alignmentinference-time methodslogit mixingcross-vocabulary transferLLM defenserefusal enhancementadversarial robustness
0
0 comments X

The pith

Safety alignment can be transferred between large language models at inference time even when they use different vocabularies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

ALIGNBEAM transfers safety from an anchor model to a target model by translating the anchor's logits into the target's vocabulary at each decoding step. Multiple candidate continuations are generated through this mixing process, and a small judge model selects the safest one. The method requires no weight changes or retraining on either model. Experiments across cross-vocabulary and same-vocabulary pairs show increased refusal rates on adversarial prompts while task accuracy remains largely intact. The safety-utility balance can be adjusted at deployment by varying the number of candidates.

Core claim

ALIGNBEAM enables inference-time transfer of safety alignment between models with incompatible vocabularies by translating anchor logits token-by-token into the target vocabulary at each decoding step and using a small LLM judge to select the safest among K candidate continuations, without modifying any model weights.

What carries the argument

Cross-vocabulary logit mixing, which converts anchor model logits into the target model's vocabulary token-by-token during decoding before judge selection among K beams.

If this is right

  • Domain-fine-tuned models can regain safety without additional training.
  • The safety-utility trade-off becomes tunable at deployment time.
  • The approach works for both cross-vocabulary and same-vocabulary model pairs.
  • No permanent changes to model weights are required.
  • Inference overhead stays within practical limits for the tested setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing and selection process could potentially transfer other behavioral properties beyond safety.
  • Maintaining a small set of specialized safe anchor models might suffice for protecting many downstream specialists.
  • The vocabulary translation step may introduce subtle biases that affect long-horizon generation in ways not captured by current benchmarks.
  • Combining ALIGNBEAM with other inference-time interventions could produce stronger composite defenses.

Load-bearing premise

A small LLM judge can reliably identify the safest continuation among the K candidates generated via cross-vocabulary logit mixing at each decoding step.

What would settle it

A consistent failure of the judge to select safe continuations on standard adversarial benchmarks, or a large drop in task accuracy below the baseline target model, would falsify the method's effectiveness.

read the original abstract

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ALIGNBEAM, a training-free inference-time method for transferring safety alignment from an anchor model to a target model across different vocabularies. It translates anchor logits token-by-token into the target vocabulary at each decoding step to generate K candidate continuations, then uses a small LLM judge to select the safest one. The method claims to raise refusal rates on adversarial benchmarks while preserving task accuracy, without modifying weights and with tunable safety-utility trade-off at deployment.

Significance. If the empirical claims hold with proper validation, the approach would demonstrate that safety alignment can be transferred between model families at inference time without retraining or weight access, addressing degradation from domain fine-tuning. The cross-vocabulary capability and lack of free parameters in the core mixing step are potential strengths.

major comments (2)
  1. [Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.
  2. [Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.
minor comments (1)
  1. [Method] Clarify the exact token-by-token translation procedure with an equation or pseudocode to make the cross-vocabulary mixing reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the central claim that safety is transferred relies on the small LLM judge reliably selecting the aligned continuation, yet no accuracy metrics, inter-annotator agreement, comparison to human labels, or validation on adversarial inputs are reported for this selection step.

    Authors: We acknowledge that the manuscript does not report dedicated accuracy metrics or human validation specifically for the LLM judge's selection decisions. The paper's empirical claims rest on the end-to-end results of ALIGNBEAM (Section 4), where the judge-based selection is shown to contribute to higher refusal rates across benchmarks. To address this, we will add an appendix containing a validation study of the judge on a subset of adversarial inputs, including agreement with human labels and basic accuracy metrics. This will be incorporated in the revision. revision: yes

  2. Referee: [Abstract] Abstract: the assertion of 'substantially raises refusal on adversarial benchmarks while keeping task accuracy... within practical bounds' is presented without any quantitative results, tables, baselines, or error analysis, preventing assessment of whether the evidence supports the claim.

    Authors: The abstract is a high-level summary of the method and its outcomes. The supporting quantitative evidence—including refusal rates on adversarial benchmarks, task accuracy preservation, baseline comparisons, and error analysis—is provided in full in Section 4 (Experiments) along with the associated tables and figures. These sections enable direct assessment of the claims. We can add one or two key quantitative highlights to the abstract if space permits, but we view the current structure as standard for the venue. revision: partial

Circularity Check

0 steps flagged

No significant circularity: method is empirical and self-contained

full rationale

The paper introduces ALIGNBEAM as a training-free inference-time procedure using logit translation and an LLM judge. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. The central claim (safety transfer without weight updates) is evaluated on external benchmarks rather than being tautological. No self-citation load-bearing steps or ansatz smuggling appear in the provided text. This is the normal case of an independent empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; no details on fitting, background assumptions, or new postulated components are given.

pith-pipeline@v0.9.1-grok · 5698 in / 1085 out tokens · 29418 ms · 2026-06-27T10:00:22.695112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 3 linked inside Pith

  1. [1]

    Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine- tuning aligned language models compromises safety, even when fine-tuned with harmless data. InInternational Conference on Learning Representations, 2024. 7 AlignBeam: Inference-Time Alignment Transfer

  2. [2]

    SafeDecoding: Defending against jailbreak attacks via safety-aware decoding

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against jailbreak attacks via safety-aware decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587–5605. Association for Computational Linguistics, 2024

  3. [3]

    Alisa Liu, Xiaochuang Han, Yizhong Wang, Yulia Tsvetkov, Yejin Choi, and Noah A. Smith. Tuning language models by proxy. InConference on Language Modeling (COLM), 2024

  4. [4]

    Contrastive decoding: Open-ended text generation as optimization

    Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive decoding: Open-ended text generation as optimization. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7422–7437, 2023

  5. [5]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InInternational Conference on Learning Representations, 2025

  6. [6]

    Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674, 2023

  7. [7]

    RAIN: Your language models can align themselves without finetuning

    Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: Your language models can align themselves without finetuning. InInternational Conference on Learning Representations, 2024

  8. [8]

    Nudging: Inference-time alignment of LLMs via guided decoding

    Yu Fei, Yasaman Razeghi, and Sameer Singh. Nudging: Inference-time alignment of LLMs via guided decoding. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  9. [9]

    Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth

    James Y . Huang, Sailik Sengupta, Daniele Bonadiman, Yi-An Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. DeAL: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025

  10. [10]

    Bikel, Jason Weston, and Eric Michael Smith

    Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M. Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. InInternational Conference on Learning Representations, 2025

  11. [11]

    Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

  12. [12]

    HarmBench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal. InInternational Conference on Machine Learning, 2024

  13. [13]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  14. [14]

    SORRY-Bench: Systematically evaluating large language model safety refusal behaviors

    Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, and Prateek Mittal. SORRY-Bench: Systematically evaluating large language model safety refusal behaviors. InInternational Conference on Learning Representations, 2025

  15. [15]

    WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, and Nouha Dziri. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.arXiv preprint arXiv:2406.18510, 2024

  16. [16]

    XSTest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  17. [17]

    OR-Bench: An over-refusal benchmark for large language models

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. OR-Bench: An over-refusal benchmark for large language models. InInternational Conference on Machine Learning, 2025

  18. [18]

    Pappas, Florian Tramer, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. JailbreakBench: An open robustness benchmark for jailbreaking large language models. InAdvances in Neural Information Processing Systems (Datasets and...

  19. [19]

    Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  20. [20]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021. 8 AlignBeam: Inference-Time Alignment Transfer

  21. [21]

    MedSafetyBench: Evaluating and improving the medical safety of large language models

    Tessa Han, Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. MedSafetyBench: Evaluating and improving the medical safety of large language models. InAdvances in Neural Information Processing Systems, 2024

  22. [22]

    I’m sorry

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. WildGuard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. InAdvances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024. 9 AlignBeam: Inference-Time Alignment Transfer A Acrony...