Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Fajri Koto; Ikhlasul Akmal Hanif; Rashad Aziz

arxiv: 2606.01196 · v1 · pith:IU4SUP62new · submitted 2026-05-31 · 💻 cs.CL · cs.AI

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Rashad Aziz , Ikhlasul Akmal Hanif , Fajri Koto This is my paper

Pith reviewed 2026-06-28 16:57 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords safety alignmentmultilingual LLMslow-resource languagesrefusal behaviorharmfulness directiondecision calibrationsteering methods

0 comments

The pith

Low-resource safety failures are failures to act on present harmfulness representations, not missing representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a harmfulness direction extracted from high-resource model activations separates harmful from harmless prompts in low-resource languages nearly as effectively as in high-resource ones. Despite this preserved separation, refusal rates on harmful low-resource prompts fall from 87.9% to 43.9% across three models and 23 languages. The authors conclude the underlying representation transfers but the model's calibration of the safety decision does not. They repair the gap by resetting the decision threshold of a low-rank logistic readout gate trained on high-resource data, using only 1-4 low-resource examples per class, which lifts refusal selectivity while keeping MMLU performance intact.

Core claim

The harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. The authors exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class.

What carries the argument

A low-rank logistic readout gate built on the high-resource harmfulness direction, with its decision threshold reset on minimal target-language examples to route between refusal steering and direction ablation.

If this is right

Recalibrating the gate raises mean refusal selectivity from 33.6 to 54.5 across the tested models.
The recalibrated gate preserves MMLU utility while improving cross-lingual refusal.
Adaptive steering methods such as AdaSteer and CAST inherit the same calibration failure and can be repaired by the same threshold reset.
Some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same representation-versus-calibration split may appear in other alignment dimensions such as bias or truthfulness, suggesting minimal-example recalibration as a general repair strategy.
If the pattern holds, multilingual safety alignment could shift from expensive full retraining to lightweight threshold tuning on small target-language sets.
Testing whether the harmfulness direction continues to separate prompts after the model is further fine-tuned on low-resource data would clarify whether the representation remains stable.

Load-bearing premise

Linear separability of harmful and harmless prompts in low-resource activations shows the representation is present and could drive refusal if only the decision threshold were adjusted.

What would settle it

If resetting the decision threshold of the high-resource harmfulness direction on 1-4 low-resource examples per class produces no increase in harmful refusal rates relative to the uncalibrated baseline, the claim that the failure is one of calibration would be falsified.

Figures

Figures reproduced from arXiv: 2606.01196 by Fajri Koto, Ikhlasul Akmal Hanif, Rashad Aziz.

**Figure 1.** Figure 1: Refusal degrades with lower language resource tier. Macro refusal rates across Qwen2.5-7BInstruct, Gemma-2-9B-it, and Llama-3.1-8B-Instruct show harmful refusal falling sharply in LRLs while harmless refusal stays low. English prompts are refused (Deng et al., 2024; Yong et al., 2023; Wang et al., 2024; Shen et al., 2024) [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Harmful directions mediate refusal in middle layers. Directional ablations show that HRL harmfulness directions encode harmful refusal on tested HRL PolyRefuse prompts, peaking at layer 15 for Qwen, 20 for Gemma, and 10 for Llama. 94.5% to 70.3%. Harmless refusal stays much lower throughout, with an aggregate LRL average of 9.8%. The failure is therefore selective: models often answer harmful low-resource … view at source ↗

**Figure 3.** Figure 3: Lower-resource prompts shift harmfulnessscore distributions. Gemma-2-9B projections on vHRL show harmful and harmless s(x) distributions within each resource tier; harmful projections shift downward most clearly in the LRL panel. as both a causal mediator of refusal and a linear predictor of harmfulness. The harmfulness signal is preserved in lowerresource languages. Prior work attributes the lower-resou… view at source ↗

**Figure 4.** Figure 4: Adding the HRL direction can recover refusal, but the needed amount differs by tier. Curves average Qwen2.5-7B harmful refusal within each resource tier after adding λvHRL. Crosses mark tier-wise peaks [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Harmfulness remains decodable even when models fail to refuse. Cells report PolyRefuse test AUC from s(x) under the pooled HRL direction. Yellow, orange, and red borders mark harmful-refusal rates of 50–75%, 25–50%, and below 25%, respectively. In highlighted cells, smaller numbers are actual harmfulrefusal rates. nal is present in low-resource activations along the same direction that mediates HRL refusa… view at source ↗

**Figure 6.** Figure 6: A few target-language examples recover LRL harmfulness gates. Each curve reports macro LRL test F1 after choosing a language-specific binary threshold with b harmful and b harmless target-language examples. The random-direction curve is a one-dimensional control whose threshold is calibrated with the same target-language examples. Algorithm 1 Training the HRL harmfulness subspace gate Require: Layer-k act… view at source ↗

**Figure 7.** Figure 7: Gemma few-shot latent gates on out-ofdistribution safety benchmarks. Bars report heldout balanced accuracy on MultiJail and IndoSafety for Gemma-2-9B, averaged within each dataset family. Error bars show the standard error across target languages and calibration seeds [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Gemma LRL calibration often transfers across source–target pairs. Cells report test F1 for an HRL subspace gate trained with HRL data plus the source LRL and evaluated on the target LRL. Diagonal cells report held-out same-language F1. LRL calibration transfers beyond one language [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: OOD transfer by target language. One-shot vHRL is strongest for Qwen and Gemma; Llama gains more from the HRL subspace gate [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: All-model OOD transfer by dataset. Full version of [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: All-model LRL calibration transfer across source–target pairs. Full version of [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: IndoSafety risk areas. Information Hazards are the weakest category across models [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: IndoSafety harm types. Privacy leakage, misleading information, and chatbot overreliance drive most residual OOD errors [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Human validation of the refusal judge. GPT-4o-mini labels closely match human labels on sampled LRL completions. Model r1 r2 r3 r5 r8 r10 Gemma-2-9B 0.866 0.869 0.879 0.879 0.896 0.895 Qwen2.5-7B 0.781 0.762 0.768 0.807 0.802 0.815 Llama-3.1-8B 0.680 0.696 0.690 0.713 0.722 0.720 All models 0.776 0.776 0.779 0.800 0.807 0.810 [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Low-rank HRL subspaces are enough for macro LRL classification. Test F1 saturates quickly as the HRL-only subspace rank increases [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Gemma HRL-subspace transfer by LRL. Gains are strongest at small calibration budgets and vary by language. Model used Repository License or terms Verification source Qwen2.5-7B-Instruct Qwen/Qwen2.5-7BInstruct Apache License 2.0 Hugging Face model card license tag and repository LICENSE file. gemma-2-9b-it google/gemma-2-9b-it Gemma Terms of Use Hugging Face model card license tag and Google Gemma Terms … view at source ↗

**Figure 17.** Figure 17: Qwen HRL-subspace transfer by LRL. Most languages improve sharply after one or two target examples [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: Llama HRL-subspace transfer by LRL. Low-budget calibration matters most for Khmer, Sinhala, and Yoruba [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Tier-level score distributions for Qwen and Llama. Lower-resource harmful prompts shift toward lower s(x) scores, matching the Gemma diagnostic in the main text [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Refusal-direction sweeps for Gemma and Llama. Adding λvHRL increases harmful refusal, with model- and tier-specific optima [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

read the original abstract

Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ($\Delta$ = harmful $-$ harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: https://github.com/rashadaziz/low-resource-safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows harmfulness directions transfer linearly across languages while refusal calibration does not, and offers a 1-4 example recalibration fix, but the claim that the representation is the operative one stays correlational.

read the letter

The main takeaway is that harmfulness representations appear to transfer to low-resource languages via linear probes, yet refusal rates collapse, and the authors repair selectivity by resetting a threshold on a low-rank readout with just a handful of target examples.

Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the transferred direction separates harmful from harmless prompts nearly as well as in English. Refusal drops from 87.9% to 43.9%, but their recalibration lifts mean selectivity from 33.6 to 54.5 while holding MMLU steady. They release the code.

This split between representation presence and action failure is the clearest new angle, and the practical result is easy to test. The numbers are reported consistently enough to make the observation worth checking.

The soft spot is that linear separability alone does not show the direction participates in the model's actual low-resource refusal computation. The stress-test concern holds: without ablation or steering results on low-resource activations that match the high-resource effect sizes and signs, the separation could reflect prompt distribution artifacts rather than the internal mechanism. The recalibration also fits a logistic readout and threshold to the small target set, so part of the gain may come from that adaptation step rather than pure threshold reset on an existing gate.

The work is for people focused on cross-lingual alignment and low-resource safety fixes. It deserves a serious referee because the empirical pattern and released code are concrete enough to evaluate and build on, even if the causal interpretation needs tighter tests.

Referee Report

2 major / 2 minor

Summary. The paper claims that low-resource safety failures in LLMs (refusal rates dropping from 87.9% in high-resource to 43.9% in low-resource languages) are action/calibration failures rather than representation failures. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, a harmfulness direction extracted from high-resource activations linearly separates harmful vs. harmless low-resource prompts nearly as well as high-resource ones. The authors exploit this by recalibrating a low-rank logistic readout (with decision threshold reset on 1-4 target-language examples per class) to route between refusal steering and harmfulness-direction ablation, raising mean refusal selectivity from 33.6 (strongest adapted baseline) to 54.5 while preserving MMLU. Code is released.

Significance. If the diagnosis holds, the result indicates that safety representations can transfer cross-lingually even when refusal behavior does not, enabling efficient repair via recalibration of existing components rather than new training. Consistent patterns across three models and 23 languages, plus the public code release, are strengths that support verifiability and potential impact on multilingual safety alignment.

major comments (2)

[Abstract and §4 (linear separation experiments)] Abstract and experimental sections on linear separation: The central claim that 'the relevant representation is present' is grounded in the transferred harmfulness direction achieving high linear separation on low-resource activations. However, this remains a correlational readout result; the manuscript does not report whether causal interventions (activation steering or ablation along the same direction) in low-resource settings produce refusal-rate changes whose magnitude or sign match the high-resource case. Without this, the separation could be an incidental correlate of prompt distribution rather than the operative representation used by the model, which is load-bearing for the 'action failure, not representation failure' diagnosis.
[§5 (recalibration and selectivity results)] Results on recalibration (reported Δ from 33.6 to 54.5): The practical improvement relies on fitting a low-rank logistic readout and threshold to the small set of target-language examples. While the separation metric is presented as independent, the dependence of the selectivity gain on these fitted parameters (free parameters noted in the stress-test) should be quantified via ablation of the fitting procedure itself to isolate the contribution of recalibration.

minor comments (2)

[Methods and results sections] The exact criteria for selecting the 23 languages, the precise definitions of all baselines (including the 'strongest adapted baseline'), and whether error bars or statistical tests accompany the refusal rates and selectivity metrics are not fully specified in the text; adding these would improve replicability.
[§3 (methods)] Notation for the harmfulness direction and the low-rank logistic readout could be introduced with an equation or explicit definition early in the paper to aid readers in following the transfer and recalibration arguments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the positive assessment of the paper's significance, consistency across models, and code release. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §4 (linear separation experiments)] Abstract and experimental sections on linear separation: The central claim that 'the relevant representation is present' is grounded in the transferred harmfulness direction achieving high linear separation on low-resource activations. However, this remains a correlational readout result; the manuscript does not report whether causal interventions (activation steering or ablation along the same direction) in low-resource settings produce refusal-rate changes whose magnitude or sign match the high-resource case. Without this, the separation could be an incidental correlate of prompt distribution rather than the operative representation used by the model, which is load-bearing for the 'action failure, not representation failure' diagnosis.

Authors: We agree that linear separability alone is correlational and that explicit causal evidence would more directly support the claim that the transferred direction is the operative representation. The manuscript does apply the direction causally in §5 via the recalibrated gate (routing between refusal steering and harmfulness-direction ablation) and shows resulting gains in low-resource refusal selectivity. However, we did not report a direct comparison of intervention effect sizes (refusal-rate deltas) between high- and low-resource settings along this direction. We will add this analysis in the revision, including magnitude and sign comparisons, to address the concern. revision: yes
Referee: [§5 (recalibration and selectivity results)] Results on recalibration (reported Δ from 33.6 to 54.5): The practical improvement relies on fitting a low-rank logistic readout and threshold to the small set of target-language examples. While the separation metric is presented as independent, the dependence of the selectivity gain on these fitted parameters (free parameters noted in the stress-test) should be quantified via ablation of the fitting procedure itself to isolate the contribution of recalibration.

Authors: We agree that the selectivity improvement depends on the fitted readout and threshold, and that an ablation isolating this contribution would clarify the role of recalibration. We will add an ablation that fixes the logistic parameters and threshold to their high-resource values (no target-language fitting) and reports the resulting low-resource selectivity, thereby quantifying the incremental gain from the 1-4 examples. revision: yes

Circularity Check

0 steps flagged

No circularity: linear separability is an independent empirical measurement, not a fitted prediction or self-definition.

full rationale

The paper's central diagnostic claim—that the harmfulness representation is present in low-resource activations because the transferred high-resource direction separates harmful vs. harmless prompts nearly as well as in high-resource—rests on a direct linear probe evaluation, which is a measurement rather than a derivation that reduces to the conclusion by construction. The subsequent recalibration method fits a low-rank logistic readout and threshold on 1-4 target examples to produce an improved gate, but this is presented as an applied fix, not as evidence for the representation-presence diagnosis itself. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the core separation result. The refusal-rate drop (87.9% to 43.9%) is reported as an observed failure mode independent of the probe. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation of linear separability plus the assumption that this separability is sufficient evidence of representational presence; the practical fix introduces fitted parameters for the readout and threshold.

free parameters (2)

safety decision threshold = reset using 1-4 examples per class
The threshold of the low-rank logistic readout is reset using 1-4 target-language examples per class.
low-rank logistic readout parameters
The readout itself is a low-rank logistic regression whose weights are derived from high-resource data before threshold adjustment.

axioms (1)

domain assumption Linear separability of harmful versus harmless prompts in the extracted activation direction indicates that the underlying safety representation is present and can be acted upon once the decision threshold is properly calibrated.
Invoked directly when concluding that the representation transfers while the action does not, based on the reported separation performance.

pith-pipeline@v0.9.1-grok · 5810 in / 1493 out tokens · 35276 ms · 2026-06-28T16:57:34.970337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 41 canonical work pages · 12 internal anchors

[1]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =. 2203.02155 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Ferrer, Cristian Canton and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

work page internal anchor Pith review Pith/arXiv arXiv
[4]

2024 , eprint =

The. 2024 , eprint =

2024
[5]

2023 , eprint =

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2023 , eprint =

2023
[6]

Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Cappelli, Alessandro and Alobeidli, Hamza and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle =. The. 2023 , url =. 2306.01116 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , url =. 2304.01373 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year =. The. 2101.00027 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya Harsh and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi Raghavi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack...

work page doi:10.18653/v1/2024.acl-long.841 2024
[10]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , url =. doi:10.18653/v1/2024.acl-long.840 , eprint =

work page doi:10.18653/v1/2024.acl-long.840 2024
[11]

Unsupervised Cross-lingual Representation Learning at Scale

Unsupervised Cross-lingual Representation Learning at Scale , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , eprint =

work page doi:10.18653/v1/2020.acl-main.747 2020
[12]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages =

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages =. 2020 , url =. 1911.00359 , archiveprefix =

work page arXiv 2020
[13]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025
[14]

Röttger, H

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =. 2024 , url =. doi:10.18653/v1/2024.naacl-long.301 , eprint =

work page doi:10.18653/v1/2024.naacl-long.301 2024
[15]

The Twelfth International Conference on Learning Representations , year =

Multilingual Jailbreak Challenges in Large Language Models , author =. The Twelfth International Conference on Learning Representations , year =. 2310.06474 , archiveprefix =

work page arXiv
[16]

2025 , address =

Azmi, Muhammad Falensi and Al Kautsar, Muhammad Dehan and Wicaksono, Alfan Farizki and Koto, Fajri , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.465 , pages =

work page doi:10.18653/v1/2025.emnlp-main.465 2025
[17]

Low-Resource Languages Jailbreak GPT-4

Yong, Zheng-Xin and Menghini, Cristina and Bach, Stephen H. , booktitle =. Low-Resource Languages Jailbreak. 2023 , url =. 2310.02446 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Findings of the Association for Computational Linguistics: ACL 2024 , pages =

All Languages Matter: On the Multilingual Safety of Large Language Models , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , url =. doi:10.18653/v1/2024.findings-acl.349 , eprint =

work page doi:10.18653/v1/2024.findings-acl.349 2024
[19]

The Language Barrier: Dissecting Safety Challenges of

Shen, Lingfeng and Tan, Weiting and Chen, Sihao and Chen, Yunmo and Zhang, Jingyu and Xu, Haoran and Zheng, Boyuan and Koehn, Philipp and Khashabi, Daniel , booktitle =. The Language Barrier: Dissecting Safety Challenges of. 2024 , url =. doi:10.18653/v1/2024.findings-acl.156 , eprint =

work page doi:10.18653/v1/2024.findings-acl.156 2024
[20]

The State of Multilingual

Yong, Zheng Xin and Ermis, Beyza and Fadaee, Marzieh and Bach, Stephen and Kreutzer, Julia , booktitle =. The State of Multilingual. 2025 , url =. doi:10.18653/v1/2025.emnlp-main.800 , eprint =

work page doi:10.18653/v1/2025.emnlp-main.800 2025
[21]

2025 , eprint =

Kumar, Priyanshu and Jain, Devansh and Yerukola, Akhila and Jiang, Liwei and Beniwal, Himanshu and Hartvigsen, Thomas and Sap, Maarten , booktitle =. 2025 , eprint =

2025
[22]

2025 , pages =

Verma, Sahil and Hines, Keegan and Bilmes, Jeff and Siska, Charlotte and Zettlemoyer, Luke and Gonen, Hila and Singh, Chandan , booktitle =. 2025 , pages =. doi:10.18653/v1/2025.emnlp-main.819 , url =

work page doi:10.18653/v1/2025.emnlp-main.819 2025
[23]

2025 , address =

Zhang, Zekai and Guo, Yiduo and Lin, Jiuheng and Quan, Shanghaoran and Zhang, Huishuai and Zhao, Dongyan , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.62 , url =

work page doi:10.18653/v1/2025.findings-emnlp.62 2025
[24]

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for

Bu, Yuyan and Liu, Xiaohao and Ren, Zhaoxing and Yang, Yaodong and Dai, Juntao , booktitle =. Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for. 2026 , url =. 2602.16660 , archiveprefix =

work page arXiv 2026
[25]

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Yang, Junxiao and Liu, Haoran and Tu, Jinzhe and Cheng, Jiale and Zhang, Zhexin and Cui, Shiyao and Weng, Jiaqi and Tao, Jialing and Xue, Hui and Wang, Hongning and Qiu, Han and Huang, Minlie , year =. 2604.12710 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2026 , eprint=

Multilingual Safety Alignment Via Sparse Weight Editing , author=. 2026 , eprint=

2026
[27]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , url =. doi:10.18653/v1/2025.findings-emnlp.497 , eprint =

work page doi:10.18653/v1/2025.findings-emnlp.497 2025
[28]

2025 , url =

Zhao, Weixiang and Hu, Yulin and Deng, Yang and Wu, Tongtong and Zhang, Wenxuan and Guo, Jiahe and Zhang, An and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , booktitle =. 2025 , url =. doi:10.18653/v1/2025.acl-long.1149 , eprint =

work page doi:10.18653/v1/2025.acl-long.1149 2025
[29]

The Thirteenth International Conference on Learning Representations , year =

Programming Refusal with Conditional Activation Steering , author =. The Thirteenth International Conference on Learning Representations , year =. 2409.05907 , archiveprefix =

work page arXiv
[30]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , year =. 2406.11717 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

2025 , url =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , booktitle =. 2025 , url =. 2507.11878 , archiveprefix =

work page arXiv 2025
[32]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , editor =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.1248 , pages =

work page doi:10.18653/v1/2025.emnlp-main.1248 2025
[33]

Refusal in

Marshall, Thomas and Scherlis, Adam and Belrose, Nora , year =. Refusal in. 2411.09003 , archiveprefix =

work page arXiv
[34]

Proceedings of the 42nd International Conference on Machine Learning , pages =

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , url =. 2502.17420 , archiveprefix =

work page arXiv 2025
[35]

The Hidden Dimensions of

Pan, Wenbo and Liu, Zhichao and Chen, Qiguang and Zhou, Xiangyang and Yu, Haining and Jia, Xiaohua , booktitle =. The Hidden Dimensions of. 2025 , url =. 2502.09674 , archiveprefix =

work page arXiv 2025
[36]

2026 , eprint =

There Is More to Refusal in Large Language Models than a Single Direction , author =. 2026 , eprint =

2026
[37]

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) , year =

Refusal Direction Is Universal Across Safety-Aligned Languages , author =. Advances in Neural Information Processing Systems 38 (NeurIPS 2025) , year =. 2505.17306 , archiveprefix =

work page arXiv 2025
[38]

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl. Global. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2025 , url =. doi:10.18653/v1/2025.acl-long.919 , eprint =

work page doi:10.18653/v1/2025.acl-long.919 2025
[39]

2023 , eprint =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

2023
[40]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , booktitle =. Catastrophic Jailbreak of Open-source. 2024 , url =. 2310.06987 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

2023 , url =

Mazeika, Mantas and Zou, Andy and Mu, Norman and Phan, Long and Wang, Zifan and Yu, Chunru and Khoja, Adam and Jiang, Fengqing and O'Gara, Aidan and Sakhaee, Ellie and Xiang, Zhen and Rajabi, Arezoo and Hendrycks, Dan and Poovendran, Radha and Li, Bo and Forsyth, David , booktitle =. 2023 , url =

2023
[42]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. GitHub repository , howpublished =
[43]

The Hidden Space of Safety: Understanding Preference-Tuned

Verma, Nikhil and Bharadwaj, Manasa , year =. The Hidden Space of Safety: Understanding Preference-Tuned. 2504.02708 , archiveprefix =

work page arXiv
[44]

Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert , booktitle =. Do Llamas Work in. 2024 , url =. doi:10.18653/v1/2024.acl-long.820 , eprint =

work page doi:10.18653/v1/2024.acl-long.820 2024
[45]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2025 , url =. doi:10.18653/v1/2025.acl-long.1536 , eprint =

work page doi:10.18653/v1/2025.acl-long.1536 2025
[46]

The Thirteenth International Conference on Learning Representations , year =

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities , author =. The Thirteenth International Conference on Learning Representations , year =. 2411.04986 , archiveprefix =

work page arXiv
[47]

2026 , url =

On the Non-Identifiability of Steering Vectors in Large Language Models , author =. 2026 , url =. 2602.06801 , archiveprefix =

work page arXiv 2026
[48]

2026 , eprint =

The Truthfulness Spectrum Hypothesis , author =. 2026 , eprint =

2026
[49]

2026 , eprint =

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models , author =. 2026 , eprint =

2026
[50]

Transformer Circuits Thread , year =

A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =
[51]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , pages =. 2303.16634 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.153 2023
[52]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging. 2023 , url =. 2306.05685 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[1] [1]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =. 2203.02155 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and Bikel, Dan and Blecher, Lukas and Ferrer, Cristian Canton and Chen, Moya and Cucurull, Guillem and Esiobu, David and Fernandes, Jude and Fu, Jeremy and Fu, Wenyi...

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

2024 , eprint =

The. 2024 , eprint =

2024

[5] [5]

2023 , eprint =

Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. 2023 , eprint =

2023

[6] [6]

Penedo, Guilherme and Malartic, Quentin and Hesslow, Daniel and Cojocaru, Ruxandra and Cappelli, Alessandro and Alobeidli, Hamza and Pannier, Baptiste and Almazrouei, Ebtesam and Launay, Julien , booktitle =. The. 2023 , url =. 2306.01116 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , url =. 2304.01373 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor , year =. The. 2101.00027 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Groeneveld, Dirk and Beltagy, Iz and Walsh, Pete and Bhagia, Akshita and Kinney, Rodney and Tafjord, Oyvind and Jha, Ananya Harsh and Ivison, Hamish and Magnusson, Ian and Wang, Yizhong and Arora, Shane and Atkinson, David and Authur, Russell and Chandu, Khyathi Raghavi and Cohan, Arman and Dumas, Jennifer and Elazar, Yanai and Gu, Yuling and Hessel, Jack...

work page doi:10.18653/v1/2024.acl-long.841 2024

[10] [10]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , url =. doi:10.18653/v1/2024.acl-long.840 , eprint =

work page doi:10.18653/v1/2024.acl-long.840 2024

[11] [11]

Unsupervised Cross-lingual Representation Learning at Scale

Unsupervised Cross-lingual Representation Learning at Scale , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =. 2020 , url =. doi:10.18653/v1/2020.acl-main.747 , eprint =

work page doi:10.18653/v1/2020.acl-main.747 2020

[12] [12]

Proceedings of the Twelfth Language Resources and Evaluation Conference , pages =

Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm. Proceedings of the Twelfth Language Resources and Evaluation Conference , pages =. 2020 , url =. 1911.00359 , archiveprefix =

work page arXiv 2020

[13] [13]

2025 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2025 , eprint=

2025

[14] [14]

Röttger, H

R. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =. 2024 , url =. doi:10.18653/v1/2024.naacl-long.301 , eprint =

work page doi:10.18653/v1/2024.naacl-long.301 2024

[15] [15]

The Twelfth International Conference on Learning Representations , year =

Multilingual Jailbreak Challenges in Large Language Models , author =. The Twelfth International Conference on Learning Representations , year =. 2310.06474 , archiveprefix =

work page arXiv

[16] [16]

2025 , address =

Azmi, Muhammad Falensi and Al Kautsar, Muhammad Dehan and Wicaksono, Alfan Farizki and Koto, Fajri , booktitle =. 2025 , address =. doi:10.18653/v1/2025.emnlp-main.465 , pages =

work page doi:10.18653/v1/2025.emnlp-main.465 2025

[17] [17]

Low-Resource Languages Jailbreak GPT-4

Yong, Zheng-Xin and Menghini, Cristina and Bach, Stephen H. , booktitle =. Low-Resource Languages Jailbreak. 2023 , url =. 2310.02446 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Findings of the Association for Computational Linguistics: ACL 2024 , pages =

All Languages Matter: On the Multilingual Safety of Large Language Models , author =. Findings of the Association for Computational Linguistics: ACL 2024 , pages =. 2024 , url =. doi:10.18653/v1/2024.findings-acl.349 , eprint =

work page doi:10.18653/v1/2024.findings-acl.349 2024

[19] [19]

The Language Barrier: Dissecting Safety Challenges of

Shen, Lingfeng and Tan, Weiting and Chen, Sihao and Chen, Yunmo and Zhang, Jingyu and Xu, Haoran and Zheng, Boyuan and Koehn, Philipp and Khashabi, Daniel , booktitle =. The Language Barrier: Dissecting Safety Challenges of. 2024 , url =. doi:10.18653/v1/2024.findings-acl.156 , eprint =

work page doi:10.18653/v1/2024.findings-acl.156 2024

[20] [20]

The State of Multilingual

Yong, Zheng Xin and Ermis, Beyza and Fadaee, Marzieh and Bach, Stephen and Kreutzer, Julia , booktitle =. The State of Multilingual. 2025 , url =. doi:10.18653/v1/2025.emnlp-main.800 , eprint =

work page doi:10.18653/v1/2025.emnlp-main.800 2025

[21] [21]

2025 , eprint =

Kumar, Priyanshu and Jain, Devansh and Yerukola, Akhila and Jiang, Liwei and Beniwal, Himanshu and Hartvigsen, Thomas and Sap, Maarten , booktitle =. 2025 , eprint =

2025

[22] [22]

2025 , pages =

Verma, Sahil and Hines, Keegan and Bilmes, Jeff and Siska, Charlotte and Zettlemoyer, Luke and Gonen, Hila and Singh, Chandan , booktitle =. 2025 , pages =. doi:10.18653/v1/2025.emnlp-main.819 , url =

work page doi:10.18653/v1/2025.emnlp-main.819 2025

[23] [23]

2025 , address =

Zhang, Zekai and Guo, Yiduo and Lin, Jiuheng and Quan, Shanghaoran and Zhang, Huishuai and Zhao, Dongyan , booktitle =. 2025 , address =. doi:10.18653/v1/2025.findings-emnlp.62 , url =

work page doi:10.18653/v1/2025.findings-emnlp.62 2025

[24] [24]

Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for

Bu, Yuyan and Liu, Xiaohao and Ren, Zhaoxing and Yang, Yaodong and Dai, Juntao , booktitle =. Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for. 2026 , url =. 2602.16660 , archiveprefix =

work page arXiv 2026

[25] [25]

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Yang, Junxiao and Liu, Haoran and Tu, Jinzhe and Cheng, Jiale and Zhang, Zhexin and Cui, Shiyao and Weng, Jiaqi and Tao, Jialing and Xue, Hui and Wang, Hongning and Qiu, Han and Huang, Minlie , year =. 2604.12710 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2026 , eprint=

Multilingual Safety Alignment Via Sparse Weight Editing , author=. 2026 , eprint=

2026

[27] [27]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages =. 2025 , url =. doi:10.18653/v1/2025.findings-emnlp.497 , eprint =

work page doi:10.18653/v1/2025.findings-emnlp.497 2025

[28] [28]

2025 , url =

Zhao, Weixiang and Hu, Yulin and Deng, Yang and Wu, Tongtong and Zhang, Wenxuan and Guo, Jiahe and Zhang, An and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , booktitle =. 2025 , url =. doi:10.18653/v1/2025.acl-long.1149 , eprint =

work page doi:10.18653/v1/2025.acl-long.1149 2025

[29] [29]

The Thirteenth International Conference on Learning Representations , year =

Programming Refusal with Conditional Activation Steering , author =. The Thirteenth International Conference on Learning Representations , year =. 2409.05907 , archiveprefix =

work page arXiv

[30] [30]

Refusal in Language Models Is Mediated by a Single Direction

Refusal in Language Models Is Mediated by a Single Direction , author =. Advances in Neural Information Processing Systems 37 (NeurIPS 2024) , year =. 2406.11717 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

2025 , url =

Zhao, Jiachen and Huang, Jing and Wu, Zhengxuan and Bau, David and Shi, Weiyan , booktitle =. 2025 , url =. 2507.11878 , archiveprefix =

work page arXiv 2025

[32] [32]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =

Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , editor =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2025.emnlp-main.1248 , pages =

work page doi:10.18653/v1/2025.emnlp-main.1248 2025

[33] [33]

Refusal in

Marshall, Thomas and Scherlis, Adam and Belrose, Nora , year =. Refusal in. 2411.09003 , archiveprefix =

work page arXiv

[34] [34]

Proceedings of the 42nd International Conference on Machine Learning , pages =

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , url =. 2502.17420 , archiveprefix =

work page arXiv 2025

[35] [35]

The Hidden Dimensions of

Pan, Wenbo and Liu, Zhichao and Chen, Qiguang and Zhou, Xiangyang and Yu, Haining and Jia, Xiaohua , booktitle =. The Hidden Dimensions of. 2025 , url =. 2502.09674 , archiveprefix =

work page arXiv 2025

[36] [36]

2026 , eprint =

There Is More to Refusal in Large Language Models than a Single Direction , author =. 2026 , eprint =

2026

[37] [37]

Advances in Neural Information Processing Systems 38 (NeurIPS 2025) , year =

Refusal Direction Is Universal Across Safety-Aligned Languages , author =. Advances in Neural Information Processing Systems 38 (NeurIPS 2025) , year =. 2505.17306 , archiveprefix =

work page arXiv 2025

[38] [38]

Singh, Shivalika and Romanou, Angelika and Fourrier, Cl. Global. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2025 , url =. doi:10.18653/v1/2025.acl-long.919 , eprint =

work page doi:10.18653/v1/2025.acl-long.919 2025

[39] [39]

2023 , eprint =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

2023

[40] [40]

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi , booktitle =. Catastrophic Jailbreak of Open-source. 2024 , url =. 2310.06987 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

2023 , url =

Mazeika, Mantas and Zou, Andy and Mu, Norman and Phan, Long and Wang, Zifan and Yu, Chunru and Khoja, Adam and Jiang, Fengqing and O'Gara, Aidan and Sakhaee, Ellie and Xiang, Zhen and Rajabi, Arezoo and Hendrycks, Dan and Poovendran, Radha and Li, Bo and Forsyth, David , booktitle =. 2023 , url =

2023

[42] [42]

, year =

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B. , year =. GitHub repository , howpublished =

[43] [43]

The Hidden Space of Safety: Understanding Preference-Tuned

Verma, Nikhil and Bharadwaj, Manasa , year =. The Hidden Space of Safety: Understanding Preference-Tuned. 2504.02708 , archiveprefix =

work page arXiv

[44] [44]

Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert , booktitle =. Do Llamas Work in. 2024 , url =. doi:10.18653/v1/2024.acl-long.820 , eprint =

work page doi:10.18653/v1/2024.acl-long.820 2024

[45] [45]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Separating Tongue from Thought: Activation Patching Reveals Language-Agnostic Concept Representations in Transformers , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2025 , url =. doi:10.18653/v1/2025.acl-long.1536 , eprint =

work page doi:10.18653/v1/2025.acl-long.1536 2025

[46] [46]

The Thirteenth International Conference on Learning Representations , year =

The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities , author =. The Thirteenth International Conference on Learning Representations , year =. 2411.04986 , archiveprefix =

work page arXiv

[47] [47]

2026 , url =

On the Non-Identifiability of Steering Vectors in Large Language Models , author =. 2026 , url =. 2602.06801 , archiveprefix =

work page arXiv 2026

[48] [48]

2026 , eprint =

The Truthfulness Spectrum Hypothesis , author =. 2026 , eprint =

2026

[49] [49]

2026 , eprint =

The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models , author =. 2026 , eprint =

2026

[50] [50]

Transformer Circuits Thread , year =

A Mathematical Framework for Transformer Circuits , author =. Transformer Circuits Thread , year =

[51] [51]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =. 2023 , address =. doi:10.18653/v1/2023.emnlp-main.153 , pages =. 2303.16634 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.153 2023

[52] [52]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph and Stoica, Ion , booktitle =. Judging. 2023 , url =. 2306.05685 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Reimers, Nils and Gurevych, Iryna. Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019