Sense Representations Are Inducible Interfaces

Alham Fikri Aji; Jan Christian Blaise Cruz

arxiv: 2605.28669 · v1 · pith:4NXFMTSWnew · submitted 2026-05-27 · 💻 cs.CL · cs.AI

Sense Representations Are Inducible Interfaces

Jan Christian Blaise Cruz , Alham Fikri Aji This is my paper

Pith reviewed 2026-06-29 12:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords sense representationsinducible interfacesgated residual additionword-sense disambiguationlexical steeringcross-lingual adaptationpretrained language modelsACROS

0 comments

The pith

A gated residual addition induces sense representations as an explicit interface for any frozen pretrained language model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACROS to add an explicit sense pathway to frozen pretrained decoder language models using a gated residual addition. This creates sense variables that support zero-shot word-sense disambiguation, lexical steering, and cross-lingual adaptation while preserving the base model's original performance. A sympathetic reader would care because sense representations no longer require being baked into pretraining from the start and can instead be induced on top of ordinary models. The approach demonstrates this on SmolLM2-360M across the three tasks without task-specific changes.

Core claim

ACROS induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94).

What carries the argument

The gated residual addition that injects controlled sense variables into the hidden states of the frozen decoder LM to create an explicit sense pathway.

If this is right

Sense representations become available as an add-on for any existing pretrained decoder LM without retraining the base model.
The same induced sense variables support multiple downstream uses without requiring oracle information or architecture changes per task.
Base model quality on its original tasks remains intact after the addition.
Sense structure can be induced rather than requiring models to be pretrained with it baked in.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might scale to larger models to test whether the gated addition remains stable at higher parameter counts.
Similar gated additions could potentially induce other modular pathways such as syntactic or entity representations.
This suggests sense representations function as a general interface layer rather than a pretraining-only feature.
The approach opens the possibility of retrofitting sense capabilities to already deployed language models.

Load-bearing premise

That a gated residual addition trained on top of a frozen model produces sense variables faithful to the original model's knowledge and useful across disambiguation, steering, and adaptation tasks.

What would settle it

The induced sense variables show no improvement on zero-shot word-sense disambiguation beyond the base model or the WordNet first-sense heuristic when evaluated on the Raganato ALL benchmark.

Figures

Figures reproduced from arXiv: 2605.28669 by Alham Fikri Aji, Jan Christian Blaise Cruz.

**Figure 1.** Figure 1: ACROS unifies measurement, intervention, and alignment in one residual sense interface designed for pretrained decoder LMs. Existing sensebased capabilities were previously tied to different architectures, with no single interface that supports all three. ACROS inserts a residual sense pathway so the same induced variables can be read, steered, and aligned. Backpacks. Most strong checkpoints are ordinar… view at source ↗

**Figure 2.** Figure 2: Conversion fails because practical sense counts cannot span the hidden-state space the LM already uses. On SmolLM2-360M final-layer states, practical sense counts (K=8, 32, 64) retain far less variance than the ranks needed for 95% or 99% coverage, explaining why Backpack conversion harms prediction quality. 3 ACROS ACROS adds an explicit sense pathway to a pretrained decoder LM while leaving the origina… view at source ↗

**Figure 3.** Figure 3: ACROS keeps the original LM intact and learns a gated residual sense path beside it. Token embeddings feed both the frozen backbone (producing base hidden state Bq) and a trainable sense MLP (producing K sense vectors per token). A contextualization layer computes per-sense Q/K attention weights from the backbone’s hidden states, which are multiplied with the sense vectors and summed to form the sense mix… view at source ↗

**Figure 4.** Figure 4: For measurement, ACROS matches contextual target activations to WordNet gloss activations. Each candidate WordNet sense is encoded as a lemma: gloss prompt and compared against the contextual target activation in the same interface that later supports steering and alignment. Methodology. We evaluate on Raganato ALL (Raganato et al., 2017), using WordNet only at evaluation time. ACROS scores each candidat… view at source ↗

**Figure 5.** Figure 5: For intervention, ACROS tests whether a sense edit raises human substitute mass while preserving the base distribution. We score interventions by how much they increase frequency-weighted probability mass on human CoInCo substitutes while keeping KL to the original next-token distribution low. Selector Delta Succ. (%) KL target-best† +1.64e−3 100.0 5.69e−7 self top-k +1.11e−3 89.8 7.12e−7 contrib.-norm +… view at source ↗

**Figure 6.** Figure 6: ACROS-induced senses provide strong alignment anchors, allowing target-language summarization after SENSIA adaptation. Base SmolLM2 fails to generate coherent summaries before adaptation. ACROS+SENSIA produces a coherent target-language summary with low source copying (Copy-4g 0.107) after SENSIA adaptation. This contrasts with Gemma, which is pretrained with a multilingual mix but copies almost entirely… view at source ↗

read the original abstract

Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ACROS adds a gated residual to a frozen decoder LM to create usable sense variables for WSD, steering, and cross-lingual work while keeping base quality intact.

read the letter

The core claim is that a single gated residual addition, trained once on top of a frozen SmolLM2-360M, produces sense variables that work zero-shot across three tasks: 64.95 F1 on Raganato ALL for disambiguation, 90% recovery on CoInCo lexical steering, and 0.988 R@1 plus low PPL on SENSIA cross-lingual adaptation. Base LM quality is reported as preserved.

What is new is the post-pretraining induction step itself. Earlier sense work usually baked the structure into the original training objective; this treats sense as an add-on interface. The paper shows the same induced variables handle disambiguation, steering, and adaptation without further architectural changes or oracle data.

The results are presented cleanly against simple baselines like the WordNet first-sense heuristic. That the method runs on an ordinary decoder and reports no degradation in the base model is the practical point.

The soft spot is the lack of detail on the gate's training objective and on diagnostics that would show the sense variables are actually decomposing the original representations rather than learning a correlated auxiliary pathway. The abstract numbers are specific, but without the methods section it is difficult to judge whether the three-task success is robust or tied to particular choices in data and optimization.

This is for people who want lightweight control over existing decoder LMs rather than retraining from scratch. A reader working on steering or disambiguation would find the setup worth examining.

It deserves peer review because the central idea is testable and the reported numbers are concrete enough to check.

Referee Report

2 major / 2 minor

Summary. The paper introduces ACROS, a method that induces explicit sense representations into a frozen pretrained decoder LM (SmolLM2-360M) via a single gated residual addition. The induced sense variables are shown to support three tasks without task-specific changes or oracle information: zero-shot WSD (64.95 F1 on Raganato ALL), lexical steering (∼90% recovery of positive shifts on 5,161 CoInCo cases via a non-oracle proxy), and cross-lingual adaptation (mean R@1 0.988 on SENSIA to four languages, with target FLORES PPL 7.94), while preserving base LM quality.

Significance. If the central claim holds, ACROS would demonstrate that sense representations can be added as an inducible interface to ordinary pretrained LMs, enabling faithful and reusable sense variables for disambiguation, steering, and alignment without retraining or architectural overhaul. This would be a notable contribution to interpretability and controllable generation.

major comments (2)

[Abstract] Abstract: the reported metrics (64.95 F1, 90% recovery, R@1 0.988) are presented without any description of the training objective for the gate, the precise definition of the sense variables, data splits, or controls that would distinguish a faithful decomposition from an auxiliary pathway that merely correlates with the evaluation metrics.
[Abstract] Abstract and § (methods): the claim that the gated residual addition produces sense variables that are both faithful to the base model's knowledge and directly usable zero-shot across tasks rests on the untested assumption that a single trained gate suffices; no diagnostic (e.g., ablation of the gate, comparison to random residuals, or internal activation analysis) is referenced to support this.

minor comments (2)

[Abstract] The abstract states results on SmolLM2-360M but refers to 'ordinary pretrained LMs'; clarify the scope of generalization and whether the method was tested on additional model families or sizes.
[Abstract] Minor notation: the term 'gated residual addition' is used without an equation or diagram in the provided abstract; a formal definition would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where the abstract and methods could be strengthened with additional detail and diagnostics. We respond to each point below and will incorporate revisions to improve clarity without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported metrics (64.95 F1, 90% recovery, R@1 0.988) are presented without any description of the training objective for the gate, the precise definition of the sense variables, data splits, or controls that would distinguish a faithful decomposition from an auxiliary pathway that merely correlates with the evaluation metrics.

Authors: The abstract is space-constrained and therefore omits these specifics, which are instead provided in the methods section. The gate is trained with a supervised objective on sense-annotated data to minimize divergence from gold sense labels while keeping the base LM frozen. Sense variables are explicitly the per-token residual vectors added via the gate. Data splits follow the standard Raganato ALL for WSD, the full CoInCo set for steering, and SENSIA for cross-lingual evaluation. Controls include direct comparison against the unmodified base model on all three tasks. We will revise the abstract to include a one-sentence summary of the objective and variable definition, and expand the methods with an explicit statement of the controls. revision: yes
Referee: [Abstract] Abstract and § (methods): the claim that the gated residual addition produces sense variables that are both faithful to the base model's knowledge and directly usable zero-shot across tasks rests on the untested assumption that a single trained gate suffices; no diagnostic (e.g., ablation of the gate, comparison to random residuals, or internal activation analysis) is referenced to support this.

Authors: The multi-task results (zero-shot WSD, lexical steering, and cross-lingual adaptation) using the identical induced variables without any task-specific retraining provide indirect support for the claim. Nevertheless, the referee is correct that explicit diagnostics such as gate ablation, random-residual baselines, and activation analysis are not currently reported. We will add these experiments in the revision to directly test whether the gate produces a faithful decomposition rather than an auxiliary correlation. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes ACROS as an external gated residual addition trained on a frozen pretrained decoder LM to induce sense variables. No equations, fitted parameters, or predictions are presented that reduce reported performance (WSD F1, lexical steering KL, cross-lingual R@1) to a self-definition or re-expression of the base model's internals. The method is framed as an additive interface rather than a decomposition derived from the original representations. No self-citation chains or ansatzes are invoked as load-bearing uniqueness theorems. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of free parameters, axioms, or invented entities; the gated residual addition is presented as the core mechanism but its internal parameterization is not detailed.

pith-pipeline@v0.9.1-grok · 5684 in / 1092 out tokens · 29100 ms · 2026-06-29T12:32:37.653930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Chalnev, S., Siu, M., and Conmy, A

Improving steering vectors by targeting sparse autoencoder features.Preprint, arXiv:2411.02193. Jan Christian Blaise Cruz, David Ifeoluwa Adelani, and Alham Fikri Aji. 2026. Multilinguality as sense adap- tation.Preprint, arXiv:2601.10310. Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar

work page arXiv 2026
[2]

Gemma 3 Technical Report

Steering off course: Reliability challenges in steering language models. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 19856–19882, Vienna, Austria. Association for Com- putational Linguistics. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Ja...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

all-words

Model editing with canonical examples. Preprint, arXiv:2402.06155. John Hewitt, John Thickstun, Christopher Manning, and Percy Liang. 2023. Backpack language models. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 9103–9125, Toronto, Canada. Association for Computational Linguisti...

work page arXiv 2023
[4]

Steering Language Models With Activation Engineering

Few-shot learning with multilingual generative language models. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Daniel Loureiro, Alípio Mário Jorge, and Jose Camacho- Collados. 2022. Lmms reloaded: Transformer-based se...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Chalnev, S., Siu, M., and Conmy, A

Improving steering vectors by targeting sparse autoencoder features.Preprint, arXiv:2411.02193. Jan Christian Blaise Cruz, David Ifeoluwa Adelani, and Alham Fikri Aji. 2026. Multilinguality as sense adap- tation.Preprint, arXiv:2601.10310. Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar

work page arXiv 2026

[2] [2]

Gemma 3 Technical Report

Steering off course: Reliability challenges in steering language models. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 19856–19882, Vienna, Austria. Association for Com- putational Linguistics. Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Ja...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

all-words

Model editing with canonical examples. Preprint, arXiv:2402.06155. John Hewitt, John Thickstun, Christopher Manning, and Percy Liang. 2023. Backpack language models. In Proceedings of the 61st Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 9103–9125, Toronto, Canada. Association for Computational Linguisti...

work page arXiv 2023

[4] [4]

Steering Language Models With Activation Engineering

Few-shot learning with multilingual generative language models. InProceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 9019–9052, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics. Daniel Loureiro, Alípio Mário Jorge, and Jose Camacho- Collados. 2022. Lmms reloaded: Transformer-based se...

work page internal anchor Pith review Pith/arXiv arXiv 2022