hub

arXiv preprint arXiv:2402.17762 (2024) 18 J

Sun, M · 2024 · arXiv 2402.17762

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

cs.CL · 2026-05-08 · conditional · novelty 7.0 · 2 refs

Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.

When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

cs.CV · 2026-04-01 · unverdicted · novelty 7.0

Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

Attention Sinks in Diffusion Transformers: A Causal Analysis

cs.CV · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.

Taming Outlier Tokens in Diffusion Transformers

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.

Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing

cs.CR · 2026-04-27 · unverdicted · novelty 6.0

TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.

Graph-Guided Adaptive Channel Elimination for KV Cache Compression

eess.SP · 2026-04-18 · unverdicted · novelty 6.0

GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

cs.CL · 2025-05-10 · conditional · novelty 6.0

Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

cs.CL · 2024-06-04 · conditional · novelty 6.0

PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.

Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay

cs.CV · 2026-05-02 · unverdicted · novelty 5.0

Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.

OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering up to 1.78x speedup over W8A8 baselines.

Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation

cs.CL · 2026-04-03 · unverdicted · novelty 5.0

Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.

MiMo-V2-Flash Technical Report

cs.CL · 2026-01-06 · unverdicted · novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.

DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

citing papers explorer

Showing 17 of 17 citing papers.

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models cs.CL · 2026-05-08 · conditional · none · ref 23 · 2 links
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs cs.LG · 2026-04-22 · unverdicted · none · ref 39
Sink-Token-aware Pruning (SToP) suppresses semantically uninformative sink tokens during visual token pruning in Video LLMs, boosting fine-grained performance even at 90% pruning rates across hallucination, reasoning, and MCQA benchmarks.
When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models cs.CV · 2026-04-01 · unverdicted · none · ref 36
Attention sinks in LVLM create a global-vs-local trade-off that a layer-wise gating module can balance to improve multimodal benchmark performance.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 60
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination cs.MM · 2026-05-11 · unverdicted · none · ref 74
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Attention Sinks in Diffusion Transformers: A Causal Analysis cs.CV · 2026-05-10 · unverdicted · none · ref 12 · 2 links
Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.
Taming Outlier Tokens in Diffusion Transformers cs.CV · 2026-05-06 · unverdicted · none · ref 30
Outlier tokens in DiTs are addressed with Dual-Stage Registers, which reduce artifacts and improve image generation on ImageNet and text-to-image tasks.
Defusing the Trigger: Plug-and-Play Defense for Backdoored LLMs via Tail-Risk Intrinsic Geometric Smoothing cs.CR · 2026-04-27 · unverdicted · none · ref 38
TIGS detects backdoor-induced attention collapse in LLMs and applies content-aware tail-risk screening plus intrinsic geometric smoothing to suppress attacks while preserving normal performance.
Graph-Guided Adaptive Channel Elimination for KV Cache Compression eess.SP · 2026-04-18 · unverdicted · none · ref 25
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free cs.CL · 2025-05-10 · conditional · none · ref 25
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling cs.CL · 2024-06-04 · conditional · none · ref 19
PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory cs.AI · 2026-05-07 · unverdicted · none · ref 36
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
Colinearity Decay: Training Quantization-Friendly ViTs with Outlier Decay cs.CV · 2026-05-02 · unverdicted · none · ref 18
Colinearity-Decay regularizer trains ViTs that maintain or improve full-precision accuracy while delivering higher accuracy after low-bit quantization on ImageNet and COCO tasks.
OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension cs.LG · 2026-04-14 · unverdicted · none · ref 18
OSC separates token-persistent outlier channels in activations into a compact high-precision tensor for dual-path 4-bit GEMM computation, limiting accuracy loss to roughly 1-2 points on Qwen3 models while delivering up to 1.78x speedup over W8A8 baselines.
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation cs.CL · 2026-04-03 · unverdicted · none · ref 12
Residual-stream noise injection raises narrative diversity in Arabic educational stories while preserving reading-grade level, outperforming high-temperature sampling across five 7-9B models.
MiMo-V2-Flash Technical Report cs.CL · 2026-01-06 · unverdicted · none · ref 44
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurposed MTP layers.
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization cs.CV · 2026-04-20 · unverdicted · none · ref 17
DuQuant++ adapts outlier-aware fine-grained rotation to MXFP4 by matching block size to the 32-element microscaling group, enabling a single rotation that smooths distributions and achieves SOTA performance on LLaMA-3 with lower cost.

arXiv preprint arXiv:2402.17762 (2024) 18 J

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer