hub

arXiv preprint arXiv:2510.06477 , year=

Enrique Queipo-de Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv · 2025 · arXiv 2510.06477

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.

A Mechanistic Analysis of Looped Reasoning Language Models

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 6.0

DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.

Uncovering the Latent Potential of Deep Intermediate Representations

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.

When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.

SLASH the Sink: Sharpening Structural Attention Inside LLMs

cs.AI · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

SLASH is a plug-and-play attention redistribution technique that counters attention sinks to enhance LLMs' intrinsic graph topology reconstruction without any training or fine-tuning.

Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.

When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models

cs.CL · 2026-04-13 · unverdicted · novelty 5.0

Non-closing truth recursion prompts destabilize LLM attention matrices with large effect sizes, unlike grounded self-reference or factual controls, and increase contradictory model outputs.

Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

cs.CL · 2026-02-01

citing papers explorer

Showing 13 of 13 citing papers.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CL · 2026-05-08 · conditional · none · ref 14 · 2 links
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
Sink vs. diagonal patterns as mechanisms for attention switch and oversmoothing prevention cs.LG · 2026-05-08 · unverdicted · none · ref 28
Sinks are equivalent to hard attention switches that zero out outputs and are cheaper than diagonal patterns when self-communication is allowed, closing the gap between oversmoothing prevention needs and what sinks provide.
RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs cs.LG · 2026-05-01 · unverdicted · none · ref 46
RouteHijack is a routing-aware jailbreak that identifies safety-critical experts via activation contrast and optimizes suffixes to suppress them, reaching 69.3% average attack success rate on seven MoE LLMs with strong transfer to variants and VLMs.
A Mechanistic Analysis of Looped Reasoning Language Models cs.LG · 2026-04-13 · unverdicted · none · ref 24
Looped LLMs converge to distinct cyclic fixed points per layer, repeating feedforward-style inference stages across recurrences.
DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics cs.LG · 2026-06-10 · unverdicted · none · ref 22
DynamicPTQ uses new metrics of residual-stream dynamics to apply 8-bit activation precision only to quantization-sensitive layers in W4A4KV4 LLM inference, improving perplexity and QA performance over static smoothing baselines.
DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning cs.CV · 2026-06-06 · unverdicted · none · ref 1
DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers cs.LG · 2026-05-29 · unverdicted · none · ref 83
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
Uncovering the Latent Potential of Deep Intermediate Representations cs.LG · 2026-05-21 · unverdicted · none · ref 5
Introduces LOES, a constructive spectral method to select task-discriminative subspaces from intermediate layer embeddings, and GeoReg for enforcing simplicial class geometry during fine-tuning, with reported gains increasing with model depth across modalities.
When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction cs.AI · 2026-05-13 · unverdicted · none · ref 35
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
SLASH the Sink: Sharpening Structural Attention Inside LLMs cs.AI · 2026-05-11 · unverdicted · none · ref 22 · 3 links
SLASH is a plug-and-play attention redistribution technique that counters attention sinks to enhance LLMs' intrinsic graph topology reconstruction without any training or fine-tuning.
Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task cs.LG · 2026-04-14 · unverdicted · none · ref 20
Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.
When Self-Reference Fails to Close: Matrix-Level Dynamics in Large Language Models cs.CL · 2026-04-13 · unverdicted · none · ref 15
Non-closing truth recursion prompts destabilize LLM attention matrices with large effect sizes, unlike grounded self-reference or factual controls, and increase contradictory model outputs.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse cs.CL · 2026-02-01 · unreviewed · ref 10

arXiv preprint arXiv:2510.06477 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer