pith. machine review for the scientific record. sign in

arxiv: 2605.07693 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: no theorem link

Toward Better Geometric Representations for Molecule Generative Models

Cai Zhou, Kai Liu, Muhan Zhang, Qiaojing Huang, Shaoheng Yan, Zian Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords molecule generationgeometric representationsrepresentation-conditioned generationLENSEsREPA losspretrained encodersUniMol3D molecule stability
0
0 comments X

The pith

LENSEs improves molecule generation by training a representation head, perceptual loss, and node-level alignment to smooth pretrained encoder outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that representation-conditioned molecule generation is bottlenecked by non-smooth pretrained encoders such as UniMol. LENSEs addresses this with three mechanisms that extract multi-level features, optimize the generator in semantic space, and explicitly align generator states to encoder representations. If these changes work as described, generated molecules become more valid and stable without altering the underlying generator architecture. A reader would care because smoother representations could make 3D molecule models more practical for chemistry and drug-design workflows.

Core claim

Pretrained molecular encoders produce non-smooth representations that limit generation quality. LENSEs counters this by simultaneously training a representation head for multi-level features, applying a molecule perceptual loss in the semantic space, and using a node-level representation alignment (REPA) loss to close the gap between pretraining and generation. On the GEOM-DRUG dataset this yields 97.28 percent validity and 98.51 percent molecule stability while also lowering the Lipschitz constant by a factor of 4.6 and improving performance on QM9 probing tasks, indicating that generative training with alignment objectives can serve as a pretraining paradigm for molecular encoders.

What carries the argument

The LENSEs framework consisting of a jointly trained representation head, molecule perceptual loss, and REPA alignment loss that refines pretrained geometric representations during generator training.

If this is right

  • Molecule generators conditioned on refined representations achieve higher validity and stability on challenging datasets such as GEOM-DRUG.
  • The Lipschitz constant of the learned mapping decreases substantially, producing smoother representation spaces.
  • Probing tasks on QM9 show that the refined representations carry more semantic information for downstream molecular tasks.
  • Generative training combined with alignment losses can function as an effective pretraining objective for molecular encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment pattern could be tested on other pretrained encoders beyond UniMol to check whether non-smoothness is a general limitation.
  • If the perceptual and alignment losses prove robust, they might reduce reliance on extensive hyperparameter searches when adapting new encoders to generation.
  • The approach implicitly suggests that future encoder pretraining objectives should incorporate generative consistency as a regularizer.

Load-bearing premise

The observed gains in validity, stability, and smoothness arise primarily from the three proposed mechanisms rather than from baseline implementation differences or dataset-specific tuning choices.

What would settle it

Retraining the same generator on GEOM-DRUG after ablating any one of the three mechanisms (representation head, perceptual loss, or REPA) and finding that validity or stability falls below the previous state-of-the-art would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.07693 by Cai Zhou, Kai Liu, Muhan Zhang, Qiaojing Huang, Shaoheng Yan, Zian Li.

Figure 1
Figure 1. Figure 1: Layer-wise representation analysis (Frad on QM9). Left: Cross-layer cosine similarity heatmap. Right: Linear probe accuracy across all 9 layers; early layers consistently outperform late layers across all functional groups. 0.815 Max Similarity [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cosine similarity between the genera￾tor’s node representations and the pretrained en￾coder’s layer-wise representations. Higher similar￾ity (warmer colors) indicates stronger representa￾tional alignment. As shown in the right panel of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The framework of LENSEs. Phase I: the molecule generator and the representation head are trained based on a pretrained encoder. The representation head takes as input a learnable pooling of representations from different encoder layers, while the molecule generator performs conditional generation using the reparameterized latent produced by the representation head. Phase II: the RDM is trained to model the… view at source ↗
Figure 4
Figure 4. Figure 4: Latent space visualization. The two rows show [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-layer cosine similarity heatmap. Cosine similarity between mean layer representa￾tions reveals two distinct blocks: L0–L5 (feature extraction, r = 0.66–0.95) and L6–L8 (semantic aggregation, r = 0.61–0.96). The sharpest transition occurs at L5→L6. B.2.1 Learnable Layer Pooling Weights [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learnable layer pooling in the representation head (GEOM-DRUG). B.2.2 Pairwise Distance Distribution [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: shows the distribution of pairwise Euclidean distances between molecular representations in the raw UniMol encoder space versus the representation-head’s latent space. The raw encoder representations cluster into an extremely tight region (mean distance: 0.59, std: 0.49). The repre￾sentation head expands this space by 287× (mean: 169.12, std: 42.49), producing a well-structured latent manifold with meaning… view at source ↗
read the original abstract

Geometric representation-conditioned molecule generation provides an effective paradigm that decouples molecule representation modeling from structure generation. By decoupling molecule generation into two stages-first generating a meaningful molecule representation, and then generating a 3D molecule conditioned on this representation-the efficiency and quality of the generation process can be significantly enhanced. However, its effectiveness is fundamentally limited by the quality of the representation space: pretrained molecular encoders, such as UniMol, produce representations that are non-smooth and not fully exploited during the generative training process. In this work, we propose LENSEs, a framework that better exploits the potential of molecule representations in representation-conditioned generation methods. In particular, LENSEs introduces three complementary mechanisms: (1) a representation head, simultaneously trained during generative tasks, that extracts multi-level representations from the pretrained encoder; (2) a molecule perceptual loss that optimizes the generator in a semantic-informative representation space; and (3) a node-level representation alignment (REPA) loss that explicitly aligns the generator's hidden states with encoder representations, reducing the semantic gap between pretraining and generation. We demonstrate the effectiveness of these improvements through extensive molecule generation tasks. Specifically, on the challenging molecule generation dataset GEOM-DRUG, LENSEs achieves 97.28% validity and 98.51% molecule stability, surpassing existing advanced methods. Further analyses through Lipschitz constant reduction (4.6x) and QM9 probing tasks also demonstrate the smoother, more informative refined representations, establishing generative training with alignment objectives as a potential pretraining paradigm for molecular encoders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LENSEs, a framework for improving geometric representations in molecule generative models by decoupling representation learning from structure generation. It proposes three mechanisms—a jointly trained representation head extracting multi-level features from pretrained encoders like UniMol, a molecule perceptual loss operating in semantic representation space, and a node-level REPA alignment loss to reduce the pretraining-generation semantic gap—and reports that these yield smoother, more informative representations. On the GEOM-DRUG dataset, LENSEs achieves 97.28% validity and 98.51% stability, outperforming prior methods, with supporting evidence from a 4.6x Lipschitz constant reduction and QM9 probing tasks; the work also suggests generative alignment as a potential pretraining paradigm for molecular encoders.

Significance. If the reported gains on GEOM-DRUG and the Lipschitz/QM9 analyses are shown to be causally attributable to the three proposed mechanisms rather than uncontrolled factors, the work would meaningfully advance representation-conditioned molecule generation by demonstrating how to better exploit pretrained encoders. The concrete metrics, explicit Lipschitz analysis, and suggestion of generative training as pretraining provide falsifiable claims that could influence follow-up work on smoother molecular latent spaces.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the headline claim that the three mechanisms (representation head, perceptual loss, REPA) are the primary drivers of the 97.28% validity / 98.51% stability on GEOM-DRUG is not supported by ablation studies that hold encoder weights, optimizer schedule, total compute, and data splits fixed while varying only the new components; without such controls, the attribution to smoother representations remains unverified.
  2. [Abstract] Abstract: the 4.6x Lipschitz constant reduction is presented as evidence of improved representation smoothness, but the manuscript does not specify the exact computation (e.g., which layers, finite-difference method, or sampling strategy) or whether the reduction persists when the generator is trained without the perceptual and REPA losses, weakening the link to the proposed mechanisms.
  3. [Experiments] Experiments section: baseline comparisons to prior advanced methods do not clarify whether those baselines were re-implemented with the identical UniMol encoder and training budget or taken directly from original publications; any mismatch in implementation details could account for the reported gains independently of the new losses.
minor comments (2)
  1. [Method] The equations defining the perceptual loss and REPA alignment loss would benefit from explicit notation for the weighting hyperparameters and the precise form of the alignment (e.g., cosine similarity or MSE on node embeddings) to aid reproducibility.
  2. [Experiments] Table reporting GEOM-DRUG metrics should include standard deviations over multiple runs and a column indicating whether each baseline uses the same pretrained encoder backbone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications based on the manuscript and committing to revisions where the concerns identify gaps in detail or controls.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the headline claim that the three mechanisms (representation head, perceptual loss, REPA) are the primary drivers of the 97.28% validity / 98.51% stability on GEOM-DRUG is not supported by ablation studies that hold encoder weights, optimizer schedule, total compute, and data splits fixed while varying only the new components; without such controls, the attribution to smoother representations remains unverified.

    Authors: We agree that isolating the causal contribution of the representation head, perceptual loss, and REPA requires ablations with all other factors fixed. The manuscript reports full LENSEs results against prior methods plus supporting Lipschitz and QM9 analyses, but does not include the exact controlled ablations described. In the revision we will add these experiments: the pretrained UniMol encoder weights will be held fixed, the optimizer schedule, total compute, and data splits will be identical, and we will vary only the presence of each new component (and combinations thereof). This will directly test attribution to the proposed mechanisms. revision: yes

  2. Referee: [Abstract] Abstract: the 4.6x Lipschitz constant reduction is presented as evidence of improved representation smoothness, but the manuscript does not specify the exact computation (e.g., which layers, finite-difference method, or sampling strategy) or whether the reduction persists when the generator is trained without the perceptual and REPA losses, weakening the link to the proposed mechanisms.

    Authors: We acknowledge the need for greater transparency on the Lipschitz analysis. The revised manuscript will specify the exact layers evaluated, the finite-difference procedure, and the sampling strategy over representations. We will also report the Lipschitz constant obtained when the generator is trained without the perceptual and REPA losses (while keeping all other settings identical). This will clarify whether the reported reduction is attributable to the alignment objectives. revision: yes

  3. Referee: [Experiments] Experiments section: baseline comparisons to prior advanced methods do not clarify whether those baselines were re-implemented with the identical UniMol encoder and training budget or taken directly from original publications; any mismatch in implementation details could account for the reported gains independently of the new losses.

    Authors: All baselines were re-implemented from scratch using the identical UniMol encoder, training budget, optimizer schedule, and data splits as LENSEs. The revised Experiments section will explicitly state this re-implementation protocol and list the shared hyperparameters to eliminate any ambiguity about implementation differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposal with external benchmarks and no self-referential reductions.

full rationale

The paper introduces LENSEs as a framework with three mechanisms (representation head, perceptual loss, REPA alignment) trained jointly on top of a fixed pretrained encoder (UniMol). Performance claims are supported by direct evaluation on GEOM-DRUG (validity/stability metrics) and auxiliary analyses (Lipschitz constants, QM9 probing), none of which reduce by construction to fitted inputs or self-citations. No equations are presented that equate a 'prediction' to a parameter fit, and no uniqueness theorems or ansatzes are imported from prior author work. The derivation chain consists of standard generative modeling steps plus new loss terms whose effects are measured externally rather than defined into existence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that pretrained encoders provide a useful but imperfect starting point whose limitations can be mitigated by joint training with generative objectives; no new physical entities are postulated.

free parameters (1)
  • loss weighting hyperparameters
    Weights balancing the perceptual loss and REPA loss against the main generative objective are likely tuned on validation data.
axioms (1)
  • domain assumption Pretrained molecular encoders such as UniMol produce representations that are non-smooth and under-exploited in generative training.
    Stated directly in the motivation section of the abstract as the core limitation being addressed.

pith-pipeline@v0.9.0 · 5590 in / 1300 out tokens · 51318 ms · 2026-05-11T03:11:21.683860+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 4 internal anchors

  1. [1]

    Deep generative molecular design reshapes drug discovery.Cell Reports Medicine, 3(12), 2022

    Xiangxiang Zeng, Fei Wang, Yuan Luo, Seung-gu Kang, Jian Tang, Felice C Lightstone, Evandro F Fang, Wendy Cornell, Ruth Nussinov, and Feixiong Cheng. Deep generative molecular design reshapes drug discovery.Cell Reports Medicine, 3(12), 2022

  2. [2]

    Molecular design in drug discovery: a comprehensive review of deep generative models.Briefings in bioinformatics, 22(6):bbab344, 2021

    Yu Cheng, Yongshun Gong, Yuansheng Liu, Bosheng Song, and Quan Zou. Molecular design in drug discovery: a comprehensive review of deep generative models.Briefings in bioinformatics, 22(6):bbab344, 2021

  3. [3]

    A survey of generative ai for de novo drug design: new frontiers in molecule and protein generation.Briefings in Bioinformatics, 25(4):bbae338, 2024

    Xiangru Tang, Howard Dai, Elizabeth Knight, Fang Wu, Yunyang Li, Tianxiao Li, and Mark Gerstein. A survey of generative ai for de novo drug design: new frontiers in molecule and protein generation.Briefings in Bioinformatics, 25(4):bbae338, 2024

  4. [4]

    Smiles, a chemical language and information system

    David Weininger. Smiles, a chemical language and information system. 1. introduction to methodology and encoding rules.Journal of Chemical Information and Computer Sciences, 28 (1):31–36, 1988

  5. [5]

    Self-referencing embedded strings (selfies): A 100% robust molecular string representation

    Mario Krenn, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. Self-referencing embedded strings (selfies): A 100% robust molecular string representation. Machine Learning: Science and Technology, 1(4):045024, 2020

  6. [6]

    Group selfies: a robust fragment-based molecular string representation.Digital Discovery, 2(3):748–758, 2023

    Austin H Cheng, Andy Cai, Santiago Miret, Gustavo Malkomes, Mariano Phielipp, and Alán Aspuru-Guzik. Group selfies: a robust fragment-based molecular string representation.Digital Discovery, 2(3):748–758, 2023

  7. [7]

    Equivariant diffusion for molecule generation in 3d.International Conference on Machine Learning, pages 9087–9102, 2022

    Emiel Hoogeboom, Víctor Garcia Satorras, Clement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d.International Conference on Machine Learning, pages 9087–9102, 2022

  8. [8]

    arXiv preprint arXiv:2203.02923 , year=

    Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geo- metric diffusion model for molecular conformation generation.arXiv preprint arXiv:2203.02923, 2022

  9. [9]

    Molvision: Molecular property prediction with vision language models.arXiv preprint arXiv:2507.03283, 2025

    Deepan Adak, Yogesh Singh Rawat, and Shruti Vyas. Molvision: Molecular property prediction with vision language models.arXiv preprint arXiv:2507.03283, 2025

  10. [10]

    Zhanfeng Wang, Wenhao Zhang, Minghong Jiang, Yicheng Chen, Zhenyu Zhu, Wenjie Yan, Jianming Wu, and Xin Xu. X2-gnn: A physical message passing neural network with natural generalization ability to large and complex molecules.The Journal of Physical Chemistry Letters, 15(51):12501–12512, 2024

  11. [11]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  12. [12]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  13. [13]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  14. [14]

    De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023

    Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion.Nature, 620(7976):1089–1100, 2023

  15. [15]

    Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

    Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion steps, twists, and turns for molecular docking.arXiv preprint arXiv:2210.01776, 2022

  16. [16]

    Applications of deep learning in molecule generation and molecular property prediction.Accounts of chemical research, 54(2):263–270, 2020

    W Patrick Walters and Regina Barzilay. Applications of deep learning in molecule generation and molecular property prediction.Accounts of chemical research, 54(2):263–270, 2020

  17. [17]

    Self-driving laboratories for chemistry and materials science.Chemical Reviews, 124(16):9633–9732, 2024

    Gary Tom, Stefan P Schmid, Sterling G Baird, Yang Cao, Kourosh Darvish, Han Hao, Stanley Lo, Sergio Pablo-García, Ella M Rajaonson, Marta Skreta, et al. Self-driving laboratories for chemistry and materials science.Chemical Reviews, 124(16):9633–9732, 2024. 11

  18. [18]

    Advances and challenges in deep generative models for de novo molecule generation

    Dongyu Xue, Yukang Gong, Zhaoyi Yang, Guohui Chuai, Sheng Qu, Aizong Shen, Jing Yu, and Qi Liu. Advances and challenges in deep generative models for de novo molecule generation. Wiley Interdisciplinary Reviews: Computational Molecular Science, 9(3):e1395, 2019

  19. [19]

    Flowmol3: flow matching for 3d de novo small-molecule generation.Digital Discovery, 2026

    Ian Dunn and David R Koes. Flowmol3: flow matching for 3d de novo small-molecule generation.Digital Discovery, 2026

  20. [20]

    Propmolflow: property-guided molecule generation with geometry-complete flow matching.Nature Computa- tional Science, pages 1–10, 2026

    Cheng Zeng, Jirui Jin, Connor Ambrose, George Karypis, Mark Transtrum, Ellad B Tadmor, Richard G Hennig, Adrian Roitberg, Stefano Martiniani, and Mingjie Liu. Propmolflow: property-guided molecule generation with geometry-complete flow matching.Nature Computa- tional Science, pages 1–10, 2026

  21. [21]

    Applications of modular co-design for de novo 3d molecule generation.Digital Discovery, 5(2):754–768, 2026

    Danny Reidenbach, Filipp Nikitin, Olexandr Isayev, and Saee Gopal Paliwal. Applications of modular co-design for de novo 3d molecule generation.Digital Discovery, 5(2):754–768, 2026

  22. [22]

    3d molecule generation from rigid motifs via se (3) flows.arXiv preprint arXiv:2601.16955, 2026

    Roman Poletukhin, Marcel Kollovieh, Eike Eberhard, and Stephan Günnemann. 3d molecule generation from rigid motifs via se (3) flows.arXiv preprint arXiv:2601.16955, 2026

  23. [23]

    Bidirectional molecule generation with recurrent neural networks.Journal of chemical information and modeling, 60(3):1175–1183, 2020

    Francesca Grisoni, Michael Moret, Robin Lingwood, and Gisbert Schneider. Bidirectional molecule generation with recurrent neural networks.Journal of chemical information and modeling, 60(3):1175–1183, 2020

  24. [24]

    Geometric latent diffusion models for 3d molecule generation

    Minkai Xu, Alexander S Powers, Ron O Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. InInternational Conference on Machine Learning, pages 38592–38610. PMLR, 2023

  25. [25]

    Geometric representation condition improves equivariant molecule generation.arXiv preprint arXiv:2410.03655, 2024

    Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, and Muhan Zhang. Geometric representation condition improves equivariant molecule generation.arXiv preprint arXiv:2410.03655, 2024

  26. [26]

    Diffusion model as representation learner

    Xingyi Yang and Xinchao Wang. Diffusion model as representation learner. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18938–18949, 2023

  27. [27]

    Uni-mol: A universal 3d molecular representation learning framework

    Gengmo Zhou, Zhifeng Gao, Qiankun Ding, Hang Zheng, Hongteng Xu, Zhewei Wei, Linfeng Zhang, and Guolin Ke. Uni-mol: A universal 3d molecular representation learning framework. International Conference on Learning Representations, 2023

  28. [28]

    Fractional denoising for 3d molecular pre-training.International Conference on Machine Learning, 2023

    Shikun Feng, Zhi Ni, Yanyan Lan, Zhi-Ming Ma, and Wei-Ying Ma. Fractional denoising for 3d molecular pre-training.International Conference on Machine Learning, 2023

  29. [29]

    Return of unconditional generation: A self- supervised representation generation method.Advances in Neural Information Processing Systems, 37:125441–125468, 2024

    Tianhong Li, Dina Katabi, and Kaiming He. Return of unconditional generation: A self- supervised representation generation method.Advances in Neural Information Processing Systems, 37:125441–125468, 2024

  30. [30]

    Georecon: Graph-level representation learning for 3d molecules via reconstruction-based pretraining.arXiv preprint arXiv:2506.13174, 2025

    Shaoheng Yan, Zian Li, and Muhan Zhang. Georecon: Graph-level representation learning for 3d molecules via reconstruction-based pretraining.arXiv preprint arXiv:2506.13174, 2025

  31. [31]

    Multi-modal molecular representation learning via structure awareness.IEEE Transactions on Image Processing, 2025

    Rong Yin, Ruyue Liu, Xiaoshuai Hao, Xingrui Zhou, Yong Liu, Can Ma, and Weiping Wang. Multi-modal molecular representation learning via structure awareness.IEEE Transactions on Image Processing, 2025

  32. [32]

    Image style transfer using convolutional neural networks

    Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016

  33. [33]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  34. [34]

    Perceptual losses for real-time style transfer and super-resolution.European Conference on Computer Vision, pages 694–711, 2016

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution.European Conference on Computer Vision, pages 694–711, 2016

  35. [35]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 12

  36. [36]

    fDOLwOcAA+yedyTcERK5GoVcJuo=

    Cai Zhou, Zijie Chen, Zian Li, Jike Wang, Kaiyi Jiang, Pan Li, Rose Yu, Muhan Zhang, Stephen Bates, and Tommi Jaakkola. Rethinking diffusion models with symmetries through canoni- calization with applications to molecular graph generation.arXiv preprint arXiv:2602.15022, 2026

  37. [37]

    Geometry-complete diffusion for 3D molecule gen- eration and optimization.Communications Chemistry, 7(1):150, 2024

    Alex Morehead and Jianlin Cheng. Geometry-complete diffusion for 3D molecule gen- eration and optimization.Communications Chemistry, 7(1):150, 2024. doi: 10.1038/ s42004-024-01233-z

  38. [38]

    Midi: Mixed graph and 3d denoising diffusion for molecule generation.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

    Clement Vignac, Nagham Osman, Laura Toni, and Pascal Frossard. Midi: Mixed graph and 3d denoising diffusion for molecule generation.European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2023

  39. [39]

    Equivariant flow matching with hybrid probability transport for 3d molecule generation.Advances in Neural Information Processing Systems, 36:549–568, 2023

    Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou, and Wei-Ying Ma. Equivariant flow matching with hybrid probability transport for 3d molecule generation.Advances in Neural Information Processing Systems, 36:549–568, 2023

  40. [40]

    Accelerating 3d molecule generation via jointly geometric optimal transport.arXiv preprint arXiv:2405.15252, 2024

    Haokai Hong, Wanyu Lin, and Kay Chen Tan. Accelerating 3d molecule generation via jointly geometric optimal transport.arXiv preprint arXiv:2405.15252, 2024

  41. [41]

    arXiv preprint arXiv:2303.03543 , year=

    Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, and Jianzhu Ma. 3d equivariant diffusion for target-aware molecule generation and affinity prediction.arXiv preprint arXiv:2303.03543, 2023

  42. [42]

    Structure-based drug design with equivariant diffusion models.Nature Computational Science, 4(12):899–909, 2024

    Arne Schneuing, Charles Harris, Yuanqi Du, Kieran Didi, Arian Jamasb, Ilia Igashov, Weitao Du, Carla Gomes, Tom L Blundell, Pietro Lio, et al. Structure-based drug design with equivariant diffusion models.Nature Computational Science, 4(12):899–909, 2024

  43. [43]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  44. [44]

    Geometric latent diffusion models for 3D molecule generation

    Minkai Xu, Alexander Powers, Ron Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3D molecule generation. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 38592–38610. PMLR, 2023

  45. [45]

    The unrea- sonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unrea- sonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  46. [46]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023

  47. [47]

    Spin-nerf: Multiview segmen- tation and perceptual inpainting with neural radiance fields

    Ashkan Mirzaei, Tristan Aumentado-Armstrong, Konstantinos G Derpanis, Jonathan Kelly, Marcus A Brubaker, Igor Gilitschenski, and Alex Levinshtein. Spin-nerf: Multiview segmen- tation and perceptual inpainting with neural radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20669–20679, 2023

  48. [48]

    Learning diffusion models with flexible representation guidance

    Chenyu Wang, Cai Zhou, Sharut Gupta, Zongyu Lin, Stefanie Jegelka, Stephen Bates, and Tommi Jaakkola. Learning diffusion models with flexible representation guidance. InICML 2025 Generative AI and Biology (GenBio) Workshop, 2025. URL https://openreview. net/forum?id=o2W4FTtBVJ

  49. [49]

    Pre-Training via

    Sheheryar Zaidi, Michael Schaarschmidt, James Martens, Hyunjik Kim, Yee Whye Teh, Alvaro Sanchez-Gonzalez, Peter Battaglia, Razvan Pascanu, and Jonathan Godwin. Pre-training via denoising for molecular property prediction.arXiv preprint arXiv:2206.00133, 2022

  50. [50]

    Self-conditioned denoising for atomistic represen- tation learning.arXiv preprint arXiv:2603.17196, 2026

    Tynan Perez and Rafael Gomez-Bombarelli. Self-conditioned denoising for atomistic represen- tation learning.arXiv preprint arXiv:2603.17196, 2026

  51. [51]

    Quantum chemistry structures and properties of 134 kilo molecules.Scientific Data, 1(1):1–7, 2014

    Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules.Scientific Data, 1(1):1–7, 2014. 13

  52. [52]

    Auto-encoding variational bayes.International Confer- ence on Learning Representations, 2014

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.International Confer- ence on Learning Representations, 2014

  53. [53]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  54. [54]

    Geom, energy-annotated molecular conforma- tions for property prediction and molecular generation.Scientific Data, 9(1):185, 2022

    Simon Axelrod and Rafael Gomez-Bombarelli. Geom, energy-annotated molecular conforma- tions for property prediction and molecular generation.Scientific Data, 9(1):185, 2022

  55. [55]

    SemlaFlow–Efficient 3D molecular generation with latent attention and equivariant flow matching.arXiv preprint arXiv:2406.07266, 2024

    Ross Irwin, Alessandro Tibo, Jon Paul Janet, and Simon Olsson. Semlaflow–efficient 3d molecular generation with latent attention and equivariant flow matching.arXiv preprint arXiv:2406.07266, 2024

  56. [56]

    Mixed continuous and categorical flow matching for 3d de novo molecule generation.ArXiv, pages arXiv–2404, 2024

    Ian Dunn and David Ryan Koes. Mixed continuous and categorical flow matching for 3d de novo molecule generation.ArXiv, pages arXiv–2404, 2024

  57. [57]

    Learning joint 2d & 3d diffusion models for complete molecule generation.arXiv preprint arXiv:2305.12347, 2023

    Han Huang, Leilei Sun, Bowen Du, and Weifeng Lv. Learning joint 2d & 3d diffusion models for complete molecule generation.arXiv preprint arXiv:2305.12347, 2023

  58. [58]

    Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation.arXiv preprint arXiv:2309.17296, 2023

    Tuan Le, Julian Cremer, Frank Noe, Djork-Arné Clevert, and Kristof Schütt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation.arXiv preprint arXiv:2309.17296, 2023

  59. [59]

    canonical slice

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pages 606–610. IEEE, 2007. 14 Appendix Contents Appendix Contents 15 A Methods and Algorithms 16 A.1 Additional Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Training Algorithm ...