pith. sign in

arxiv: 2605.15193 · v1 · pith:EEWLEKNSnew · submitted 2026-05-14 · 💻 cs.CV

Aligning Latent Geometry for Spherical Flow Matching in Image Generation

Pith reviewed 2026-06-30 20:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords flow matchinglatent geometryspherical interpolationimage generationVAE latentsradial projectionImageNet FID
0
0 comments X

The pith

Projecting VAE latents to a fixed radius and replacing linear paths with spherical interpolation improves class-conditional image generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that latent tokens from image VAEs lie on thin spherical shells where direction encodes most perceptual and semantic information while radius contributes little. It therefore projects both real latents and noise to a common radius, substitutes spherical linear interpolation for Euclidean paths, and keeps the diffusion network unchanged. The resulting training trajectories stay on the sphere with purely angular velocity targets. This produces measurable FID gains on ImageNet-256 across tokenizers with no extra encoders or alignment losses. A reader cares because the change is local to the interpolation schedule and exploits the geometry already present in standard latent spaces.

Core claim

Decomposing each latent token into radial and angular parts reveals through component-swap probes that decoded content is carried predominantly by direction. Projecting data latents onto a fixed token radius, taking the radial projection of Gaussian noise as the spherical prior, finetuning the decoder with the encoder frozen, and switching to spherical linear interpolation keeps every point on the sphere and yields purely angular velocity targets. Matched training then improves class-conditional ImageNet-256 FID across multiple image tokenizers without altering the diffusion architecture or introducing auxiliary objectives.

What carries the argument

Radial-angular decomposition of latent tokens, followed by fixed-radius projection and spherical linear interpolation.

If this is right

  • Geodesic paths remain on the sphere at every timestep by construction.
  • Velocity targets become purely angular, removing any radial component from the learning signal.
  • FID scores improve on class-conditional ImageNet-256 for multiple tokenizers under matched training.
  • No changes to the diffusion architecture or addition of auxiliary losses are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If angular dominance holds in other autoencoder spaces, the same projection-plus-slerp recipe could be tested on video or audio generation models.
  • One could measure whether permitting controlled radius jitter at inference time increases sample diversity without harming the observed FID gains.
  • Direction-only regularization derived from the same decomposition might simplify contrastive or reconstruction objectives in representation learning.

Load-bearing premise

The component-swap probes show that direction, not radius, carries nearly all perceptual and semantic content in the decoded images.

What would settle it

A controlled experiment that swaps only the radii of paired latent tokens while preserving directions and then measures large drops in semantic consistency or perceptual quality of the reconstructions would falsify the premise.

Figures

Figures reproduced from arXiv: 2605.15193 by Adil Kaan Akan, Hidir Yesiltepe, Kaan Oktay, Pinar Yanardag, Tuna Han Salih Meral.

Figure 1
Figure 1. Figure 1: Latent flow matching ignores the geometry of VAE latents. (a) Linear latent flow matching connects Gaussian noise to a VAE latent with a straight line. Although both endpoints concentrate on thin shells, the line passes through interior radii rarely occupied by either endpoint. (b) We project data latents and sample noise on a shared fixed-radius sphere, then train along the spherical arc, or slerp. (c) By… view at source ↗
Figure 2
Figure 2. Figure 2: Linear paths can dip away from the endpoint shells; shell paths interpolate token radii; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Angular/radial decoder sensitivity. Swapping radius with a same-class neighbor preserves [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Radial share of the flow-matching velocity target, computed in each tokenizer’s flow [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Shell-decomposed path: radius and direction inter￾polated separately. Shell path. Without requiring the spherical VAE constraint, each endpoint is decomposed into a direction and a magnitude, z = r zˆ with zˆ = z/∥z∥ and r = ∥z∥, and the two are interpolated separately ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FID-50K at CFG= 1.0 for the latent-support and transport-path ablation on SiT-B/2 with the FLUX.2 VAE. Spherical-slerp reaches FID= 30 in about 2.2× fewer train￾ing steps than vanilla-linear and continues to improve. The matched-protocol comparison in Sec. 4 fixes the training budget at 80 epochs to control compute across tok￾enizers and scales. To confirm that the spherical-slerp advantage is not spe￾cifi… view at source ↗
Figure 7
Figure 7. Figure 7: Angular/radial decoder sensitivity, population-mean substitute. For each tokenizer, markers [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-sample direction vs. radius sensitivity, population-mean substitute. Each point is one [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-sample direction vs. radius sensitivity, same-class partner substitute. Axes as in Fig. 8; [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Class-conditional samples from spherical-slerp SiT-XL/2 with the FLUX.2 tokenizer. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: FLUX.2 same-class component swaps. Columns: anchor (Original 1), keep-direction [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: VA-VAE same-class component swaps; columns as in Fig. 11. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: REPA-E FLUX.1 same-class component swaps; columns as in Fig. 11. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: FLUX.2 reconstructions. Columns: original image, original (Vanilla) decoder, matched [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: VA-VAE reconstructions; columns and rows as in Fig. 14. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: REPA-E FLUX.1 reconstructions; columns and rows as in Fig. 14. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
read the original abstract

Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes aligning latent geometry for spherical flow matching in image generation. It decomposes each latent token into radial and angular components and uses component-swap probes to argue that decoded perceptual and semantic content is carried predominantly by direction (with radius contributing much less). Based on this, the authors project data latents onto a fixed token radius, adopt the radial projection of Gaussian noise as a spherical prior, finetune the decoder (encoder frozen), and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths remain on the sphere at every timestep with purely angular velocity targets. Under matched training, the method is reported to improve class-conditional ImageNet-256 FID across different image tokenizers while leaving the diffusion architecture unchanged and requiring no auxiliary encoder or representation-alignment objective.

Significance. If the reported FID gains hold under the stated controls and the geometric justification is robust, the work offers a lightweight, architecture-preserving modification to latent flow matching that respects the spherical concentration of VAE latents. The absence of auxiliary objectives or architectural changes is a clear practical strength; reproducible code or explicit parameter-free derivations are not mentioned.

major comments (1)
  1. [Component-swap probes (abstract and method description)] The central motivation—that perceptual and semantic content is carried predominantly by direction—rests on component-swap probes whose construction, metrics, sample count, statistical controls, and tokenizer sensitivity are not described. This justification is load-bearing for the fixed-radius projection, spherical prior, and replacement of linear paths by geodesics; without these details the observed FID improvement could arise from decoder finetuning or the spherical prior alone rather than the geometric alignment.
minor comments (2)
  1. The abstract states that improvements are 'consistent across different image tokenizers' but does not list the specific tokenizers, FID deltas, or error bars; a table summarizing these matched-training results would improve clarity.
  2. Notation for the fixed token radius and the radial projection of noise should be introduced with an equation or explicit definition to avoid ambiguity when describing the spherical prior.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on the component-swap probes. We agree that additional details are needed to strengthen the justification and will incorporate them in the revision.

read point-by-point responses
  1. Referee: [Component-swap probes (abstract and method description)] The central motivation—that perceptual and semantic content is carried predominantly by direction—rests on component-swap probes whose construction, metrics, sample count, statistical controls, and tokenizer sensitivity are not described. This justification is load-bearing for the fixed-radius projection, spherical prior, and replacement of linear paths by geodesics; without these details the observed FID improvement could arise from decoder finetuning or the spherical prior alone rather than the geometric alignment.

    Authors: We acknowledge that the original manuscript provided only a high-level description of the component-swap probes and omitted key implementation details. In the revised manuscript we will add a new subsection (Section 3.2) that fully specifies: (i) probe construction (radial/angular decomposition of each token, pairwise swapping, and re-assembly), (ii) evaluation metrics (LPIPS for perceptual content and top-1 accuracy of a frozen ImageNet classifier for semantic content), (iii) sample count (5000 ImageNet validation images, results averaged over three random seeds), (iv) statistical controls (bootstrap confidence intervals and paired t-tests with p<0.01 threshold), and (v) tokenizer sensitivity (identical trends observed for both VQGAN and KL-VAE tokenizers). To directly address the concern that FID gains might be attributable solely to decoder finetuning or the spherical prior, we will also include new ablation tables (Table 4 and Appendix C) that isolate each component under matched training budgets and compute budgets; these show that the combination of fixed-radius projection plus spherical linear interpolation yields statistically significant additional improvement beyond finetuning or spherical prior alone. These additions preserve the original experimental protocol while making the geometric motivation reproducible and falsifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method changes are explicit and results measured externally

full rationale

The paper decomposes latents into radial/angular components, uses component-swap probes to justify projecting to fixed radius and adopting slerp (with velocity targets angular by construction as a direct consequence of the spherical path choice), then reports FID gains on held-out ImageNet-256. No equation reduces the reported improvement to a fitted parameter or prior result by construction; the geometric modification is an explicit design choice whose effect is evaluated on independent metrics across tokenizers. No self-citation chains or uniqueness theorems are invoked as load-bearing. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the empirical observation that VAE latents concentrate on thin spherical shells and that direction dominates content; both are domain assumptions rather than derived quantities. No free parameters are explicitly named in the abstract, though the choice of fixed token radius is a modeling decision.

axioms (2)
  • domain assumption VAE latents concentrate in thin spherical shells
    Stated in the first sentence of the abstract as the motivation for moving from Euclidean to spherical paths.
  • domain assumption Decoded perceptual content is carried predominantly by angular direction
    Invoked to justify discarding radial variation after component-swap probes.

pith-pipeline@v0.9.1-grok · 5707 in / 1364 out tokens · 34900 ms · 2026-06-30T20:32:30.818548+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  2. [2]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  3. [3]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

  4. [4]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023

  5. [5]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022

  6. [6]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  7. [7]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024

  8. [8]

    FLUX.2: Frontier Visual Intelligence

    Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025

  9. [9]

    Making Reconstruction FID Predictive of Diffusion Generation FID

    Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, and Yan Wang. Making reconstruction fid predictive of diffusion generation fid.arXiv preprint arXiv:2603.05630, 2026

  10. [10]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  11. [11]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

  12. [12]

    Cambridge university press, 2018

    Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

  13. [13]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium.Advances in neural information processing systems, 30, 2017

  14. [14]

    Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

    Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023

  15. [15]

    Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

    Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026

  16. [16]

    R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J

    Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper- spherical variational auto-encoders.arXiv preprint arXiv:1804.00891, 2018. 10

  17. [17]

    Spherical latent spaces for stable variational autoencoders

    Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4503–4513, 2018

  18. [18]

    Image generation with a sphere encoder

    Kaiyu Yue, Menglin Jia, Ji Hou, and Tom Goldstein. Image generation with a sphere encoder. arXiv preprint arXiv:2602.15030, 2026

  19. [19]

    Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025

    Guolin Ke and Hui Xue. Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025

  20. [20]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

  21. [21]

    REPA-E: Unlocking V AE for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025

  22. [22]

    Normface: L2 hypersphere embedding for face verification

    Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for face verification. InProceedings of the 25th ACM international conference on Multimedia, pages 1041–1049, 2017

  23. [23]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019

  24. [24]

    Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

    Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018

  25. [25]

    Riemannian continuous normalizing flows.Advances in neural information processing systems, 33:2503–2515, 2020

    Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows.Advances in neural information processing systems, 33:2503–2515, 2020

  26. [26]

    Riemannian score-based generative modelling.Advances in neural information processing systems, 35:2406–2422, 2022

    Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling.Advances in neural information processing systems, 35:2406–2422, 2022

  27. [27]

    Riemannian diffusion models.Advances in Neural Information Processing Systems, 35:2750– 2761, 2022

    Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models.Advances in Neural Information Processing Systems, 35:2750– 2761, 2022

  28. [28]

    Moser flow: Divergence- based generative modeling on manifolds.Advances in neural information processing systems, 34:17669–17680, 2021

    Noam Rozen, Aditya Grover, Maximilian Nickel, and Yaron Lipman. Moser flow: Divergence- based generative modeling on manifolds.Advances in neural information processing systems, 34:17669–17680, 2021

  29. [29]

    Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024

    Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek J Bose. Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024

  30. [30]

    Riemannian variational flow matching for material and protein design.arXiv preprint arXiv:2502.12981, 2025

    Olga Zaghen, Floor Eijkelboom, Alison Pouplin, Cong Liu, Max Welling, Jan-Willem van de Meent, and Erik J Bekkers. Riemannian variational flow matching for material and protein design.arXiv preprint arXiv:2502.12981, 2025

  31. [31]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, H. Jégou, J. Mairal, Patrick ...

  32. [32]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 11

  33. [33]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  34. [34]

    Robust latent matters: Boosting image generation with sampling error synthesis.arXiv preprint arXiv:2503.08354, 2025

    Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Marios Savvides. Robust latent matters: Boosting image generation with sampling error synthesis.arXiv preprint arXiv:2503.08354, 2025

  35. [35]

    Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474, 2025

    Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Marios Savvides. Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474, 2025

  36. [36]

    Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

    Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025

  37. [37]

    EQ-V AE: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025

    Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. EQ-V AE: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025

  38. [38]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18770–18780, 2025

  39. [39]

    Masked autoencoders are effective tokenizers for diffusion models

    Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InF orty-second International Conference on Machine Learning, 2025

  40. [40]

    Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024

  41. [41]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InICLR, 2014. URL http://arxiv.org/abs/1312.6114

  42. [42]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  43. [43]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

  44. [44]

    John Wiley & Sons, 2009

    Kanti V Mardia and Peter E Jupp.Directional statistics. John Wiley & Sons, 2009

  45. [45]

    Animating rotation with quaternion curves

    Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254, 1985

  46. [46]

    Springer, 1992

    Manfredo Perdigao Do Carmo and J Flaherty Francis.Riemannian geometry, volume 393. Springer, 1992

  47. [47]

    ImageNet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

  48. [48]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7. 12 Supplementary Material A Analytical Derivations A.1 Analytical Gaussian Norm Statistics In Sec. 3.2, we use the standard fact that high-dimensional Gaussian samples concentrate near a spherical shell. Here, we give the ...