pith. machine review for the scientific record. sign in

arxiv: 2605.05206 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Taming Outlier Tokens in Diffusion Transformers

Chen Wei, Liang-Chieh Chen, Tsu-Jui Fu, Xiaoyu Wu, Yifei Wang, Zhe Gan

Pith reviewed 2026-05-08 17:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords diffusion transformersoutlier tokensdual-stage registersimage generationvision transformerstoken semanticsgenerative modelsattention mechanisms
0
0 comments X

The pith

Dual-stage registers tame outlier tokens in diffusion transformers to reduce artifacts and improve image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how a small number of high-norm tokens in Diffusion Transformers attract excessive attention yet carry corrupted local patch information, appearing in both the pretrained encoder and the internal layers of the denoiser. Simple masking of these tokens fails to help, showing the issue lies in semantic corruption rather than norm magnitude alone. The authors introduce Dual-Stage Registers as a targeted fix: learned registers for the encoder when available, recursive test-time registers otherwise, and dedicated diffusion registers for the denoiser. This intervention consistently lowers outlier artifacts and raises output quality on ImageNet-scale and large text-to-image tasks. A reader would care because cleaner token dynamics could make transformer-based generators more stable without added compute.

Core claim

Outlier tokens emerge in both the encoder and denoiser of RAE-DiT pipelines; they reflect corrupted local patch semantics that masking does not resolve. Dual-Stage Registers correct this by inserting register tokens at the two stages—trained registers for the encoder when possible, recursive test-time registers when not, and diffusion-specific registers for the denoiser—yielding measurable reductions in artifacts and gains in generation quality.

What carries the argument

Dual-Stage Registers (DSR), a register-based intervention that supplies dedicated tokens to both the encoder and denoiser to restore local semantics.

If this is right

  • Outlier artifacts decrease in the final generated images.
  • Generation quality rises on ImageNet benchmarks.
  • The same gains appear in large-scale text-to-image settings.
  • The fix applies uniformly to both pretrained ViT encoders and internal DiT layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attention layers in transformers may generally benefit from explicit register tokens to quarantine semantically damaged patches.
  • Similar register stages could be tested in non-diffusion transformer generators or in vision-language models.
  • If outlier control proves scalable, it might reduce the need for ever-larger DiT models to achieve clean outputs.

Load-bearing premise

That the performance drag comes from corrupted local patch semantics rather than raw norm extremes, and that registers can restore semantics without introducing new distortions or degrading other generation aspects.

What would settle it

Running the same DiT pipelines with Dual-Stage Registers applied and finding no change or a drop in standard image quality metrics such as FID, or discovering that high-norm masking suddenly improves results.

Figures

Figures reproduced from arXiv: 2605.05206 by Chen Wei, Liang-Chieh Chen, Tsu-Jui Fu, Xiaoyu Wu, Yifei Wang, Zhe Gan.

Figure 1
Figure 1. Figure 1: Outlier tokens in ViT-based autoencoders. We visualize token-norm maps across the layers of the SigLIP2-B encoder. Severe high-norm tokens emerge in the last few layers: the penultimate layer shows the strongest outlier pattern, while the final output becomes somewhat more stable, potentially due to the reconstruction-related training objective in SigLIP2. tokenizers, motivated by the desire to avoid lossy… view at source ↗
Figure 2
Figure 2. Figure 2: Outlier Tokens in Transformer-based Generators. We visualize token-norm maps of RAE-DiT with a SigLIP2-B encoder, across different diffusion noise scales and encoder layers. We find that high-norm outliers concentrate in the intermediate layers, while their severity decreases as the diffusion noise level increases. This pattern differs from prior observations in standard ViTs, where artifact tokens are typ… view at source ↗
Figure 3
Figure 3. Figure 3: Framework of our Dual-Stage Registers (DSR) method. Our DSR method patches both the vision encoder and the diffusion model with register tokens. The encoder uses a test-time register token, which is inserted only at inference time, while the diffusion model uses 36 trained register tokens, which are learned during diffusion training. During training, we discard the encoder-side register-token outputs befor… view at source ↗
Figure 4
Figure 4. Figure 4: Norm map comparison for validating trained encoder registers. We compare RAE￾DiT(DINOv2-B) w/o and w/ trained encoder registers, at a fixed timestep t = 0.5. We find that introducing trained registers consistently suppresses high-norm token outliers and improves the quality of patch-level representations, which in turn leads to stronger downstream generation. 4.1 Registers in Vision Encoders We begin by va… view at source ↗
Figure 5
Figure 5. Figure 5: Norm map comparison across variants. We compare the baseline with two register-token configurations: adding test-time registers in the encoder only, and further adding trained registers in the diffusion model. We find that outliers in the norm map are suppressed only when both sources of outliers are addressed, i.e., when registers are applied to both the encoder and the diffusion model. Layer 3 Layer 13 L… view at source ↗
Figure 6
Figure 6. Figure 6: PCA map comparison across variants. We observe that adding test-time registers in the encoder yields a strong and visible improvement in the PCA map. Further adding trained registers in the diffusion model brings some additional improvements view at source ↗
Figure 7
Figure 7. Figure 7: FID vs. epochs on IN-1K 2562 view at source ↗
Figure 8
Figure 8. Figure 8: Two outlier sources in the norm distribution of SIGLIP2-So400. We compute the ℓ2 norm of SIGLIP2-So400 output features on 10k randomly selected images from the ImageNet-1K validation set. Left: the original norm distribution shows two separated outlier groups. Middle: applying our filtering once removes only one group, leaving the other largely intact. Right: we therefore use a recursive procedure that re-… view at source ↗
Figure 9
Figure 9. Figure 9: Quantitative measurements of outliers. We report the fraction of outlier tokens across layers under different setups. An outlier is defined as a token whose ℓ2 norm exceeds 2× the median token norm. To complement the qualitative visualization, we also provide a quantitative view of outlier behavior across layers view at source ↗
Figure 10
Figure 10. Figure 10: Training loss comparison between Scale-RAE and DSR. Top: total loss. Bottom left: view at source ↗
Figure 11
Figure 11. Figure 11: Step-to-step comparison between Scale-RAE and DSR on Geneval and DPG-Bench. view at source ↗
Figure 12
Figure 12. Figure 12: Scale-RAE baseline (top) vs. DSR (bottom) on DPG-Bench [12]. view at source ↗
Figure 13
Figure 13. Figure 13: Scale-RAE baseline (top) vs. DSR (bottom) on DPG-Bench [12]. view at source ↗
Figure 14
Figure 14. Figure 14: Scale-RAE baseline (top) vs. DSR (bottom) on DPG-Bench [12]. view at source ↗
Figure 15
Figure 15. Figure 15: Scale-RAE baseline (top) vs. DSR (bottom) on DPG-Bench [12]. view at source ↗
Figure 16
Figure 16. Figure 16: Scale-RAE baseline (top) vs. DSR (bottom) on GenEval [9]. view at source ↗
Figure 17
Figure 17. Figure 17: Scale-RAE baseline (top) vs. DSR (bottom) on GenEval [9]. view at source ↗
Figure 18
Figure 18. Figure 18: Scale-RAE baseline (top) vs. DSR (bottom) on GenEval [9]. view at source ↗
Figure 19
Figure 19. Figure 19: Scale-RAE baseline (top) vs. DSR (bottom) on GenEval [9]. view at source ↗
read the original abstract

We study outlier tokens in Diffusion Transformers (DiTs) for image generation. Prior work has shown that Vision Transformers (ViTs) can produce a small number of high-norm tokens that attract disproportionate attention while carrying limited local information, but their role in generative models remains underexplored. We show that this phenomenon appears in both the encoder and denoiser of modern Representation Autoencoder (RAE)-DiT pipelines: pretrained ViT encoders can produce outlier representations, and DiTs themselves can develop internal outlier tokens, especially in intermediate layers. Moreover, simply masking high-norm tokens does not improve performance, indicating that the problem is not only caused by a few extreme values, but is more closely related to corrupted local patch semantics. To address this issue, we introduce Dual-Stage Registers (DSR), a register-based intervention for both components: trained registers when available, recursive test-time registers otherwise, and diffusion registers for the denoiser. Across ImageNet and large-scale text-to-image generation, these interventions consistently reduce outlier artifacts and improve generation quality. Our results highlight outlier-token control as an important ingredient in building stronger DiTs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines outlier tokens (high-norm tokens attracting disproportionate attention with limited local information) in Diffusion Transformers (DiTs) for image generation. It observes their presence in both pretrained ViT encoders and internal DiT layers, argues that the issue stems from corrupted local patch semantics rather than merely extreme norm values (because simply masking high-norm tokens does not improve performance), and proposes Dual-Stage Registers (DSR) consisting of trained registers, recursive test-time registers, and diffusion registers. The central claim is that these interventions reduce outlier artifacts and improve generation quality across ImageNet and large-scale text-to-image tasks.

Significance. If the empirical claims hold with proper controls and metrics, the work could offer a practical, register-based technique for stabilizing DiT training and inference, addressing a previously underexplored artifact in generative transformers. The multi-stage application (encoder and denoiser) is a reasonable extension of prior register ideas from ViTs.

major comments (2)
  1. [Abstract] Abstract: The claim that 'simply masking high-norm tokens does not improve performance' is used to conclude that the root cause is corrupted local patch semantics rather than extreme norms. However, masking removes the token's full contribution to attention and residuals, which can disrupt global context independently of norm magnitude; without a control that preserves token presence while clamping or normalizing norms, this does not cleanly isolate semantics from magnitude effects.
  2. [Abstract] Abstract: The assertion of 'consistent' improvements in outlier reduction and generation quality on ImageNet and text-to-image tasks lacks any quantitative support (FID, CLIP scores, artifact counts, baselines, error bars, or ablation tables). This makes the central empirical claim unverifiable from the provided text and load-bearing for the paper's contribution.
minor comments (2)
  1. [Abstract] The acronym DSR is introduced in the abstract without prior expansion.
  2. Notation for 'registers' (trained vs. test-time vs. diffusion) is used without a clear upfront definition or diagram of their placement in the pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on our manuscript. We address each major comment below with clarifications from the full paper and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: The claim that 'simply masking high-norm tokens does not improve performance' is used to conclude that the root cause is corrupted local patch semantics rather than extreme norms. However, masking removes the token's full contribution to attention and residuals, which can disrupt global context independently of norm magnitude; without a control that preserves token presence while clamping or normalizing norms, this does not cleanly isolate semantics from magnitude effects.

    Authors: We appreciate this observation on the limitations of the masking probe. The experiment was designed to show that removing high-norm tokens entirely fails to resolve the observed artifacts and performance issues, which we interpret as evidence that the problem extends beyond isolated extreme values to the underlying corrupted patch semantics. We agree that a cleaner control—such as clamping or normalizing the norms of these tokens while preserving their presence, attention contributions, and residual connections—would better isolate magnitude from semantic effects. In the revised manuscript we will add this ablation experiment (with quantitative results on both ImageNet and text-to-image settings) to strengthen the causal argument. revision: yes

  2. Referee: The assertion of 'consistent' improvements in outlier reduction and generation quality on ImageNet and text-to-image tasks lacks any quantitative support (FID, CLIP scores, artifact counts, baselines, error bars, or ablation tables). This makes the central empirical claim unverifiable from the provided text and load-bearing for the paper's contribution.

    Authors: The abstract is intentionally concise and summarizes results whose details appear in the full manuscript. Sections 4 and 5 present the supporting quantitative evidence, including FID scores on ImageNet, CLIP scores on text-to-image benchmarks, artifact counts, comparisons against baselines, ablation tables, and error bars across multiple runs. To make the central claims more immediately verifiable, we will revise the abstract to include representative quantitative highlights (e.g., key FID and CLIP deltas) while respecting length constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical observations and interventions

full rationale

The paper reports observations of high-norm tokens in ViT encoders and DiT denoisers, states that masking them fails to improve performance (directly measured), and introduces DSR registers as an empirical fix that is shown to reduce artifacts on ImageNet and text-to-image tasks. No equations, fitted parameters, predictions, or derivations appear. The interpretive claim linking outliers to 'corrupted local patch semantics' follows from the masking result but does not reduce any quantity to itself by construction. No self-citations are visible or load-bearing for the central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the named method itself; full paper would be needed to audit these.

invented entities (1)
  • Dual-Stage Registers (DSR) no independent evidence
    purpose: Intervention to tame outlier tokens in encoder and denoiser
    New technique introduced to address the diagnosed problem; no independent evidence or falsifiable prediction supplied in abstract.

pith-pipeline@v0.9.0 · 5511 in / 1071 out tokens · 53088 ms · 2026-05-08T17:15:56.181249+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 18 canonical work pages · 10 internal anchors

  1. [1]

    Latent forcing: Reordering the diffusion trajectory for pixel-space image generation

    Alan Baade, Eric Ryan Chan, Kyle Sargent, Changan Chen, Justin Johnson, Ehsan Adeli, and Li Fei-Fei. Latent forcing: Reordering the diffusion trajectory for pixel-space image generation. arXiv preprint arXiv:2602.11401, 2026

  2. [2]

    Black Forest Labs. Flux. https://github.com/black-forest-labs/flux, 2024. GitHub repository

  3. [3]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  4. [4]

    When vision transformers outperform resnets without pre-training or strong data augmentations

    Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. InInternational Conference on Learning Representation, 2022

  5. [5]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023

  6. [6]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  9. [9]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  10. [10]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  11. [11]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  12. [12]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  13. [13]

    Openclip

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip

  14. [14]

    Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

    Nick Jiang, Amil Dravid, Alexei Efros, and Yossi Gandelsman. Vision transformers don’t need trained registers.arXiv preprint arXiv:2506.08010, 2025

  15. [15]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  16. [16]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

  17. [17]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024. 10

  18. [18]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  19. [19]

    Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  20. [20]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  21. [21]

    A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

    Zihan Qiu, Zeyu Huang, Kaiyue Wen, Peng Jin, Bo Zheng, Yuxin Zhou, Haofeng Huang, Zekun Wang, Xiao Li, Huaqing Zhang, et al. A unified view of attention and residual sinks: Outlier-driven rescaling is essential for transformer training.arXiv preprint arXiv:2601.22966, 2026

  22. [22]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  23. [23]

    Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128, 2021

  24. [24]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  25. [25]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025

  26. [26]

    What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure? arXiv preprint arXiv:2512.10794, 2025

  27. [27]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. pmlr, 2015

  28. [28]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

  29. [29]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020

  30. [30]

    Z., and Liu, Z

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  31. [31]

    Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

    Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders.arXiv preprint arXiv:2601.16208, 2026

  32. [32]

    Training data-efficient image transformers & distillation through attention

    Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021

  33. [33]

    Deit iii: Revenge of the vit

    Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit iii: Revenge of the vit. InEuropean conference on computer vision, pages 516–533. Springer, 2022. 11

  34. [34]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  35. [35]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InThe Twelfth International Conference on Learning Representations, 2024

  36. [36]

    Interpretingtherepeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908,

    Itay Yona, Ilia Shumailov, Jamie Hayes, Federico Barbero, and Yossi Gandelsman. Interpreting the repeated token phenomenon in large language models.arXiv preprint arXiv:2503.08908, 2025

  37. [37]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  38. [38]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  39. [39]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 12 A Outlier Visualization on SIGLIP2-So400 We further analyze the norm distribution of SIGLIP2-So400 and observe that its outliers do not form a single homogeneous group. Instead, they appear as two clear...