Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Daniil Gurgurov; Josef van Genabith; Patrick Schramowski; Simon Ostermann; Tanja Baeumel; Yusser Al Ghussin

arxiv: 2605.23036 · v1 · pith:DGFHKRB5new · submitted 2026-05-21 · 💻 cs.CL

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

Yusser Al Ghussin , Daniil Gurgurov , Tanja Baeumel , Josef van Genabith , Patrick Schramowski , Simon Ostermann This is my paper

Pith reviewed 2026-05-25 05:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords sparse autoencodersmultilingual steeringactivation steeringlayer selectionmachine translationcross-lingual summarizationmechanistic interpretability

0 comments

The pith

Training sparse autoencoders on multilingual data strengthens cross-lingual representations and enables reliable language control by selecting layers via the intersection of alignment and separability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether sparse autoencoders trained on mixed-language data improve control over the language of model outputs compared to English-only training. It reports that multilingual training produces more consistent language steering while preserving generation quality across different model families and tasks. The authors introduce a selection rule that identifies promising intervention layers ahead of time by combining measures of how languages align and how they separate at each layer. This rule avoids the need to test every layer individually. Experiments on translation and summarization tasks with two models show the combined approach balances language accuracy and output quality more stably than prior methods.

Core claim

Training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. An a priori steering layer-selection rule based on the intersection of multilingual alignment and language separability predicts effective intervention depths without exhaustive layerwise search.

What carries the argument

Multilingual sparse autoencoders trained on mixed-language data, together with an intersection metric of multilingual alignment and language separability used to select intervention layers.

Load-bearing premise

The intersection of multilingual alignment and language separability at a given layer reliably predicts which layers will work best for steering without needing post-hoc checks on new models or tasks.

What would settle it

A demonstration that layers chosen by the intersection metric produce no better steering results than randomly chosen or heuristically chosen layers on a new model family or task would falsify the predictive rule.

Figures

Figures reproduced from arXiv: 2605.23036 by Daniil Gurgurov, Josef van Genabith, Patrick Schramowski, Simon Ostermann, Tanja Baeumel, Yusser Al Ghussin.

**Figure 2.** Figure 2: Correlation matrices of per-language contrast [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Performance deltas relative to Scope baselines for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-selection curves showing the balance [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Layerwise ∆COMET and ∆LaSE trends for LLaMA-3.1-8B averaged across SAEs under two steering regimes. Top: steer_lang ̸= target_lang. Bottom: steer_lang = target_lang. monotonic trend, peaking near the layers identified by our multilinguality–separability intersection. This divergence highlights the role of representational balance: deeper layers benefit samelanguage reinforcement, whereas effective cros… view at source ↗

**Figure 7.** Figure 7: further reproduces the early–late dynamics of multilingual representations previously reported for LLaMA-3.1-8B (Gurgurov et al., 2025; Tan et al., 2024): shared cross-lingual structure is strongest in early-to-mid layers, while language separability increases toward later depths. Notably, LLaMA-Scope exhibits substantially lower separability than even the dense residual stream across all layers, which l… view at source ↗

**Figure 8.** Figure 8: Example prompt and outputs for cross-lingual summarization (CrossSum). The model is prompted in [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Example prompt and outputs for cross-lingual summarization (CrossSum). The model is prompted in [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Example prompt and outputs for machine translation. The model is prompted in Chinese and steered to [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Example prompt and outputs for machine translation. The model is prompted in German and steered to [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

**Figure 12.** Figure 12: Performance deltas relative to Scope baselines for [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Performance deltas relative to Scope baselines for [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Layerwise heatmaps of performance deltas relative to [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Layerwise heatmaps of performance deltas relative to [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗

**Figure 16.** Figure 16: Performance deltas relative to Scope baselines for [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗

**Figure 17.** Figure 17: Performance deltas relative to Scope baselines for [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗

**Figure 18.** Figure 18: Layerwise heatmaps of performance deltas relative to [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Layerwise heatmaps of performance deltas relative to [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 20.** Figure 20: Comparison of LLama-3.1-8B model representation space using residual stream vectors, LLama-Scope [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗

**Figure 21.** Figure 21: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Per-language, per-layer deltas for Gemma-2-9B on FLORES under matched steering and target languages (tgt_i = steer_j). The heatmaps show the impact of SAE variants on language identification and translation quality across model depth [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Per-language, per-layer COMET score deltas for [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: Per-language, per-layer COMET score deltas for [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

**Figure 27.** Figure 27: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

**Figure 28.** Figure 28: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: Per-language, per-layer COMET score deltas for [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Per-language, per-layer performance deltas for [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

**Figure 32.** Figure 32: Per-language, per-layer COMET score deltas for [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multilingual SAEs plus an intersection rule for layer selection is a practical incremental step for steering reliability.

read the letter

The paper's main advance is training sparse autoencoders on multilingual data instead of English-only corpora, plus an a priori rule that picks steering layers by intersecting multilingual alignment and language separability scores. This replaces heuristic layer choice and aims to make activation steering more reliable across languages without exhaustive search. They evaluate on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization, reporting better stability in the trade-off between language identification and output quality on SpBLEU, COMET, ROUGE-L, and LaSE. That combination is new enough in the SAE steering literature to be worth noting. The multilingual training result looks plausible and the layer rule is a concrete, testable idea that could save compute. The soft spots are modest but real. The abstract and reported results give limited detail on how the intersection metric is exactly computed, what the ablation baselines were, or whether the gains survive statistical checks and more languages or models. It is not obvious yet whether the rule generalizes without post-hoc adjustment or if the improvements are large enough to change practice. The work is aimed at researchers already doing mechanistic interpretability and activation steering in multilingual settings. It has enough structure and empirical grounding to deserve a serious referee, though the methods and generalization sections would need tightening. I would send it to review.

Referee Report

1 major / 1 minor

Summary. The paper claims that training sparse autoencoders (SAEs) on multilingual data strengthens cross-lingual representations and enables more reliable, quality-preserving language control across layers and model families. It introduces an a priori layer-selection rule based on the intersection of multilingual alignment and language separability metrics to predict effective steering depths without exhaustive search. Evaluations on LLaMA-3.1-8B and Gemma-2-9B for machine translation and cross-lingual summarization (CrossSumm) using SpBLEU, ROUGE-L, COMET, and LaSE are said to show that multilingual SAEs with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality.

Significance. If the empirical claims hold, the work advances mechanistic interpretability by providing a representation-level account of multilingual SAE steering that reduces heuristic choices and English-centric biases. The a priori selection rule, if validated, would be a notable contribution for scalable intervention in multilingual settings.

major comments (1)

Abstract: the central claim that the intersection metric of multilingual alignment and language separability is a reliable a priori predictor of steering effectiveness is load-bearing, yet the abstract provides no definition, computation details, or cross-validation evidence for this metric, leaving its generalization beyond the two tested models and tasks unaddressed.

minor comments (1)

Abstract: no quantitative results, effect sizes, or specific metric improvements (e.g., changes in SpBLEU or COMET) are reported despite claims of stabilized trade-offs, which hinders assessment of practical impact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract's presentation of our central contribution. We address the concern point by point below.

read point-by-point responses

Referee: Abstract: the central claim that the intersection metric of multilingual alignment and language separability is a reliable a priori predictor of steering effectiveness is load-bearing, yet the abstract provides no definition, computation details, or cross-validation evidence for this metric, leaving its generalization beyond the two tested models and tasks unaddressed.

Authors: We agree the abstract would benefit from greater precision on this point. The manuscript body defines multilingual alignment as the cosine similarity between language-specific mean SAE activations and language separability as the accuracy of a linear probe on SAE latents; the intersection rule selects layers where both exceed English-derived thresholds, computed on a held-out multilingual calibration set. Cross-validation evidence appears in the layerwise steering results for MT and CrossSumm on LLaMA-3.1-8B and Gemma-2-9B (Tables 3–5, Figures 4–6). We will revise the abstract to include a concise parenthetical definition of the metrics and a reference to the empirical validation. The paper evaluates the rule on the two models and two tasks reported and does not claim generalization beyond this scope. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is empirically grounded and self-contained

full rationale

The paper's central claims rest on two empirical demonstrations: (1) multilingual SAE training improves cross-lingual steering reliability, and (2) an independently computed intersection metric of multilingual alignment and language separability at each layer predicts effective steering depths. Neither result is obtained by fitting parameters to the target steering outcomes and then relabeling those fits as predictions; the layer-selection rule is presented as a priori and is validated post-hoc on held-out tasks and models. No self-citation chain, self-definitional equations, or ansatz smuggling is described in the abstract or reader's summary. The derivation chain therefore remains non-circular and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no free parameters, axioms, or invented entities are specified in the text.

pith-pipeline@v0.9.0 · 5759 in / 1140 out tokens · 19828 ms · 2026-05-25T05:33:30.966917+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Improving steering vectors by target- ing sparse autoencoder features.arXiv preprint arXiv:2411.02193. Tyler A. Chang, Zhuowen Tu, and Benjamin K. Bergen

work page arXiv
[2]

InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing

The geometry of multilingual language model representations. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing. Association for Computational Linguis- tics. Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, and Sean O’Brien

work page 2022
[3]

InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics: Student Research Workshop

Causal language control in multilingual trans- formers via sparse feature steering. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics: Student Research Workshop. Association for Computational Linguistics. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Emerging cross- lingual ...

work page 2020
[4]

No Language Left Behind: Scaling Human-Centered Machine Translation

Association for Computational Linguistics. Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672. Hoagy Cunningham, Aidan Ewart, Logan Rig...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, and Simon Oster- mann

Clas-bench: A cross-lingual alignment and steering benchmark.Preprint, arXiv:2601.08331. Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, and Simon Oster- mann. 2025. Language arithmetics: Towards system- atic language neuron identification and manipulation. InProceedings of the 14th International Joint Con- ferenc...

work page arXiv 2025
[6]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov

Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov

work page arXiv
[7]

Fasttext.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ay- onrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. 2025. SAEBench: A comprehensive benchmark...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

Interpretable steering of large language mod- els with feature guided activation additions.arXiv preprint arXiv:2501.09929. Shaomu Tan, Di Wu, and Christof Monz. 2024. Neuron specialization: Leveraging intrinsic task modularity for multilingual machine translation. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing,...

work page arXiv 2024
[9]

Encode the activation into sparse space: zℓ(x) = Encoderℓ(hℓ(x))

work page
[10]

We use fixed steering coefficients for all test examples within each model setting, with α= 5.0 for LLaMA and α= 100.0 for Gemma

Apply the steering vector: z′ ℓ(x) =z ℓ(x) +α w DiffMean(ℓ), where α controls steering strength. We use fixed steering coefficients for all test examples within each model setting, with α= 5.0 for LLaMA and α= 100.0 for Gemma. These values were chosen in preliminary experi- ments as conservative values that improved target-language identification, and wer...

work page
[11]

Decode back to dense space: ˆh′ ℓ(x) = Decoderℓ(z′ ℓ(x))

work page
[12]

The corrected activation ˜hℓ(x) is then passed to subsequent layers

Correct for reconstruction error by adding the residual: ˜hℓ(x) = ˆh′ ℓ(x)+ hℓ(x)−Decoder ℓ(zℓ(x)) . The corrected activation ˜hℓ(x) is then passed to subsequent layers. This procedure preserves the original activation outside the SAE subspace while applying a targeted intervention along the language direction. D Language Correlation and Intersection-Base...

work page

[1] [1]

Improving steering vectors by target- ing sparse autoencoder features.arXiv preprint arXiv:2411.02193. Tyler A. Chang, Zhuowen Tu, and Benjamin K. Bergen

work page arXiv

[2] [2]

InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing

The geometry of multilingual language model representations. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing. Association for Computational Linguis- tics. Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, and Sean O’Brien

work page 2022

[3] [3]

InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics: Student Research Workshop

Causal language control in multilingual trans- formers via sparse feature steering. InProceedings of the 63rd Annual Meeting of the Association for Com- putational Linguistics: Student Research Workshop. Association for Computational Linguistics. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Emerging cross- lingual ...

work page 2020

[4] [4]

No Language Left Behind: Scaling Human-Centered Machine Translation

Association for Computational Linguistics. Marta R Costa-Jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, and 1 others. 2022. No language left behind: Scaling human-centered machine translation.arXiv preprint arXiv:2207.04672. Hoagy Cunningham, Aidan Ewart, Logan Rig...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, and Simon Oster- mann

Clas-bench: A cross-lingual alignment and steering benchmark.Preprint, arXiv:2601.08331. Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, and Simon Oster- mann. 2025. Language arithmetics: Towards system- atic language neuron identification and manipulation. InProceedings of the 14th International Joint Con- ferenc...

work page arXiv 2025

[6] [6]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov

Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526. Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov

work page arXiv

[7] [7]

Fasttext.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651. Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu-Tong Lau, Eoin Farrell, Callum Stuart Mcdougall, Kola Ay- onrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda. 2025. SAEBench: A comprehensive benchmark...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Interpretable steering of large language models with feature guided activation additions.arXiv preprint arXiv:2501.09929,

Interpretable steering of large language mod- els with feature guided activation additions.arXiv preprint arXiv:2501.09929. Shaomu Tan, Di Wu, and Christof Monz. 2024. Neuron specialization: Leveraging intrinsic task modularity for multilingual machine translation. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing,...

work page arXiv 2024

[9] [9]

Encode the activation into sparse space: zℓ(x) = Encoderℓ(hℓ(x))

work page

[10] [10]

We use fixed steering coefficients for all test examples within each model setting, with α= 5.0 for LLaMA and α= 100.0 for Gemma

Apply the steering vector: z′ ℓ(x) =z ℓ(x) +α w DiffMean(ℓ), where α controls steering strength. We use fixed steering coefficients for all test examples within each model setting, with α= 5.0 for LLaMA and α= 100.0 for Gemma. These values were chosen in preliminary experi- ments as conservative values that improved target-language identification, and wer...

work page

[11] [11]

Decode back to dense space: ˆh′ ℓ(x) = Decoderℓ(z′ ℓ(x))

work page

[12] [12]

The corrected activation ˜hℓ(x) is then passed to subsequent layers

Correct for reconstruction error by adding the residual: ˜hℓ(x) = ˆh′ ℓ(x)+ hℓ(x)−Decoder ℓ(zℓ(x)) . The corrected activation ˜hℓ(x) is then passed to subsequent layers. This procedure preserves the original activation outside the SAE subspace while applying a targeted intervention along the language direction. D Language Correlation and Intersection-Base...

work page