Aligning Latent Geometry for Spherical Flow Matching in Image Generation
Pith reviewed 2026-06-30 20:32 UTC · model grok-4.3
The pith
Projecting VAE latents to a fixed radius and replacing linear paths with spherical interpolation improves class-conditional image generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decomposing each latent token into radial and angular parts reveals through component-swap probes that decoded content is carried predominantly by direction. Projecting data latents onto a fixed token radius, taking the radial projection of Gaussian noise as the spherical prior, finetuning the decoder with the encoder frozen, and switching to spherical linear interpolation keeps every point on the sphere and yields purely angular velocity targets. Matched training then improves class-conditional ImageNet-256 FID across multiple image tokenizers without altering the diffusion architecture or introducing auxiliary objectives.
What carries the argument
Radial-angular decomposition of latent tokens, followed by fixed-radius projection and spherical linear interpolation.
If this is right
- Geodesic paths remain on the sphere at every timestep by construction.
- Velocity targets become purely angular, removing any radial component from the learning signal.
- FID scores improve on class-conditional ImageNet-256 for multiple tokenizers under matched training.
- No changes to the diffusion architecture or addition of auxiliary losses are required.
Where Pith is reading between the lines
- If angular dominance holds in other autoencoder spaces, the same projection-plus-slerp recipe could be tested on video or audio generation models.
- One could measure whether permitting controlled radius jitter at inference time increases sample diversity without harming the observed FID gains.
- Direction-only regularization derived from the same decomposition might simplify contrastive or reconstruction objectives in representation learning.
Load-bearing premise
The component-swap probes show that direction, not radius, carries nearly all perceptual and semantic content in the decoded images.
What would settle it
A controlled experiment that swaps only the radii of paired latent tokens while preserving directions and then measures large drops in semantic consistency or perceptual quality of the reconstructions would falsify the premise.
Figures
read the original abstract
Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes aligning latent geometry for spherical flow matching in image generation. It decomposes each latent token into radial and angular components and uses component-swap probes to argue that decoded perceptual and semantic content is carried predominantly by direction (with radius contributing much less). Based on this, the authors project data latents onto a fixed token radius, adopt the radial projection of Gaussian noise as a spherical prior, finetune the decoder (encoder frozen), and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths remain on the sphere at every timestep with purely angular velocity targets. Under matched training, the method is reported to improve class-conditional ImageNet-256 FID across different image tokenizers while leaving the diffusion architecture unchanged and requiring no auxiliary encoder or representation-alignment objective.
Significance. If the reported FID gains hold under the stated controls and the geometric justification is robust, the work offers a lightweight, architecture-preserving modification to latent flow matching that respects the spherical concentration of VAE latents. The absence of auxiliary objectives or architectural changes is a clear practical strength; reproducible code or explicit parameter-free derivations are not mentioned.
major comments (1)
- [Component-swap probes (abstract and method description)] The central motivation—that perceptual and semantic content is carried predominantly by direction—rests on component-swap probes whose construction, metrics, sample count, statistical controls, and tokenizer sensitivity are not described. This justification is load-bearing for the fixed-radius projection, spherical prior, and replacement of linear paths by geodesics; without these details the observed FID improvement could arise from decoder finetuning or the spherical prior alone rather than the geometric alignment.
minor comments (2)
- The abstract states that improvements are 'consistent across different image tokenizers' but does not list the specific tokenizers, FID deltas, or error bars; a table summarizing these matched-training results would improve clarity.
- Notation for the fixed token radius and the radial projection of noise should be introduced with an equation or explicit definition to avoid ambiguity when describing the spherical prior.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on the component-swap probes. We agree that additional details are needed to strengthen the justification and will incorporate them in the revision.
read point-by-point responses
-
Referee: [Component-swap probes (abstract and method description)] The central motivation—that perceptual and semantic content is carried predominantly by direction—rests on component-swap probes whose construction, metrics, sample count, statistical controls, and tokenizer sensitivity are not described. This justification is load-bearing for the fixed-radius projection, spherical prior, and replacement of linear paths by geodesics; without these details the observed FID improvement could arise from decoder finetuning or the spherical prior alone rather than the geometric alignment.
Authors: We acknowledge that the original manuscript provided only a high-level description of the component-swap probes and omitted key implementation details. In the revised manuscript we will add a new subsection (Section 3.2) that fully specifies: (i) probe construction (radial/angular decomposition of each token, pairwise swapping, and re-assembly), (ii) evaluation metrics (LPIPS for perceptual content and top-1 accuracy of a frozen ImageNet classifier for semantic content), (iii) sample count (5000 ImageNet validation images, results averaged over three random seeds), (iv) statistical controls (bootstrap confidence intervals and paired t-tests with p<0.01 threshold), and (v) tokenizer sensitivity (identical trends observed for both VQGAN and KL-VAE tokenizers). To directly address the concern that FID gains might be attributable solely to decoder finetuning or the spherical prior, we will also include new ablation tables (Table 4 and Appendix C) that isolate each component under matched training budgets and compute budgets; these show that the combination of fixed-radius projection plus spherical linear interpolation yields statistically significant additional improvement beyond finetuning or spherical prior alone. These additions preserve the original experimental protocol while making the geometric motivation reproducible and falsifiable. revision: yes
Circularity Check
No significant circularity; method changes are explicit and results measured externally
full rationale
The paper decomposes latents into radial/angular components, uses component-swap probes to justify projecting to fixed radius and adopting slerp (with velocity targets angular by construction as a direct consequence of the spherical path choice), then reports FID gains on held-out ImageNet-256. No equation reduces the reported improvement to a fitted parameter or prior result by construction; the geometric modification is an explicit design choice whose effect is evaluated on independent metrics across tokenizers. No self-citation chains or uniqueness theorems are invoked as load-bearing. The derivation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption VAE latents concentrate in thin spherical shells
- domain assumption Decoded perceptual content is carried predominantly by angular direction
Reference graph
Works this paper leans on
-
[1]
Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020
2020
-
[2]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
2021
-
[3]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023
2023
-
[4]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations, 2023
2023
-
[5]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
High- resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
2022
-
[7]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InInternational Conference on Machine Learning, 2024
2024
-
[8]
FLUX.2: Frontier Visual Intelligence
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. https://bfl.ai/blog/flux-2, 2025
2025
-
[9]
Making Reconstruction FID Predictive of Diffusion Generation FID
Tongda Xu, Mingwei He, Shady Abu-Hussein, Jose Miguel Hernandez-Lobato, Haotian Zhang, Kai Zhao, Chao Zhou, Ya-Qin Zhang, and Yan Wang. Making reconstruction fid predictive of diffusion generation fid.arXiv preprint arXiv:2603.05630, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[11]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
2024
-
[12]
Cambridge university press, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018
2018
-
[13]
GANs trained by a two time-scale update rule converge to a local Nash equilibrium.Advances in neural information processing systems, 30, 2017
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium.Advances in neural information processing systems, 30, 2017
2017
-
[14]
Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023
Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint arXiv:2302.03660, 2023
-
[15]
Amandeep Kumar and Vishal M Patel. Learning on the manifold: Unlocking standard diffusion transformers with representation encoders.arXiv preprint arXiv:2602.10099, 2026
-
[16]
R., Falorsi, L., De Cao, N., Kipf, T., and Tomczak, J
Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak. Hyper- spherical variational auto-encoders.arXiv preprint arXiv:1804.00891, 2018. 10
-
[17]
Spherical latent spaces for stable variational autoencoders
Jiacheng Xu and Greg Durrett. Spherical latent spaces for stable variational autoencoders. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4503–4513, 2018
2018
-
[18]
Image generation with a sphere encoder
Kaiyu Yue, Menglin Jia, Ji Hou, and Tom Goldstein. Image generation with a sphere encoder. arXiv preprint arXiv:2602.15030, 2026
-
[19]
Guolin Ke and Hui Xue. Hyperspherical latents improve continuous-token autoregressive generation.arXiv preprint arXiv:2509.24335, 2025
-
[20]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
REPA-E: Unlocking V AE for end-to-end tuning of latent diffusion transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. REPA-E: Unlocking V AE for end-to-end tuning of latent diffusion transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18262–18272, 2025
2025
-
[22]
Normface: L2 hypersphere embedding for face verification
Feng Wang, Xiang Xiang, Jian Cheng, and Alan Loddon Yuille. Normface: L2 hypersphere embedding for face verification. InProceedings of the 25th ACM international conference on Multimedia, pages 1041–1049, 2017
2017
-
[23]
Arcface: Additive angular margin loss for deep face recognition
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2019
2019
-
[24]
Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018
Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations.Advances in neural information processing systems, 31, 2018
2018
-
[25]
Riemannian continuous normalizing flows.Advances in neural information processing systems, 33:2503–2515, 2020
Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows.Advances in neural information processing systems, 33:2503–2515, 2020
2020
-
[26]
Riemannian score-based generative modelling.Advances in neural information processing systems, 35:2406–2422, 2022
Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling.Advances in neural information processing systems, 35:2406–2422, 2022
2022
-
[27]
Riemannian diffusion models.Advances in Neural Information Processing Systems, 35:2750– 2761, 2022
Chin-Wei Huang, Milad Aghajohari, Joey Bose, Prakash Panangaden, and Aaron C Courville. Riemannian diffusion models.Advances in Neural Information Processing Systems, 35:2750– 2761, 2022
2022
-
[28]
Moser flow: Divergence- based generative modeling on manifolds.Advances in neural information processing systems, 34:17669–17680, 2021
Noam Rozen, Aditya Grover, Maximilian Nickel, and Yaron Lipman. Moser flow: Divergence- based generative modeling on manifolds.Advances in neural information processing systems, 34:17669–17680, 2021
2021
-
[29]
Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024
Oscar Davis, Samuel Kessler, Mircea Petrache, ˙Ismail ˙Ilkan Ceylan, Michael Bronstein, and Avishek J Bose. Fisher flow matching for generative modeling over discrete data.Advances in Neural Information Processing Systems, 37:139054–139084, 2024
2024
-
[30]
Olga Zaghen, Floor Eijkelboom, Alison Pouplin, Cong Liu, Max Welling, Jan-Willem van de Meent, and Erik J Bekkers. Riemannian variational flow matching for material and protein design.arXiv preprint arXiv:2502.12981, 2025
-
[31]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, H. Jégou, J. Mairal, Patrick ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2023
-
[32]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 11
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
2021
-
[34]
Kai Qiu, Xiang Li, Jason Kuen, Hao Chen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Marios Savvides. Robust latent matters: Boosting image generation with sampling error synthesis.arXiv preprint arXiv:2503.08354, 2025
-
[35]
Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474, 2025
Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, and Marios Savvides. Image tokenizer needs post-training.arXiv preprint arXiv:2509.12474, 2025
-
[36]
Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
Ivan Skorokhodov, Sharath Girish, Benran Hu, Willi Menapace, Yanyu Li, Rameen Abdal, Sergey Tulyakov, and Aliaksandr Siarohin. Improving the diffusability of autoencoders.arXiv preprint arXiv:2502.14831, 2025
-
[37]
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. EQ-V AE: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509, 2025
-
[38]
Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation
Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18770–18780, 2025
2025
-
[39]
Masked autoencoders are effective tokenizers for diffusion models
Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InF orty-second International Conference on Machine Learning, 2025
2025
-
[40]
Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.Advances in Neural Information Processing Systems, 37:56424–56445, 2024
2024
-
[41]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. InICLR, 2014. URL http://arxiv.org/abs/1312.6114
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[42]
The unrea- sonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
2018
-
[43]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017
2017
-
[44]
John Wiley & Sons, 2009
Kanti V Mardia and Peter E Jupp.Directional statistics. John Wiley & Sons, 2009
2009
-
[45]
Animating rotation with quaternion curves
Ken Shoemake. Animating rotation with quaternion curves. InProceedings of the 12th annual conference on Computer graphics and interactive techniques, pages 245–254, 1985
1985
-
[46]
Springer, 1992
Manfredo Perdigao Do Carmo and J Flaherty Francis.Riemannian geometry, volume 393. Springer, 1992
1992
-
[47]
ImageNet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015
2015
-
[48]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7. 12 Supplementary Material A Analytical Derivations A.1 Analytical Gaussian Norm Statistics In Sec. 3.2, we use the standard fact that high-dimensional Gaussian samples concentrate near a spherical shell. Here, we give the ...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.