pith. sign in

arxiv: 2412.10437 · v3 · submitted 2024-12-11 · 💻 cs.CV · cs.GR· cs.LG

SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Pith reviewed 2026-05-23 07:19 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords SVG generationtext-to-vectorVAEdiffusion transformervector graphicsgenerative modelslatent space fusionrendering sequence
0
0 comments X

The pith

SVGFusion fuses SVG code and rendered pixels in a VAE then diffuses the result to produce editable text-aligned vector graphics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to generate Scalable Vector Graphics from text prompts in a way that preserves editability and structural coherence. Existing sequence-based generators accumulate errors in object relations while optimization approaches produce fixed outputs that cannot be modified afterward. SVGFusion therefore trains a single VAE to encode both the raw SVG commands and the pixel image they produce, creating a shared latent space. A diffusion transformer then samples from this space while a separate sequence model enforces correct layering order. If the approach holds, text-to-vector generation becomes both faster and more practical for downstream editing tasks.

Core claim

The central claim is that a Vector-Pixel Fusion VAE jointly encoding SVG code and its rendered image learns a latent space rich enough for a Vector Space Diffusion Transformer to perform iterative refinement, and that adding Rendering Sequence Modeling ensures correct object layering and occlusion, yielding high-quality editable SVGs that remain strictly aligned with the input text on a 240k-example dataset.

What carries the argument

The Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that jointly encodes SVG code and its rendered image to produce the latent space operated on by the diffusion transformer.

If this is right

  • The diffusion process produces globally coherent compositions through iterative denoising rather than one-shot token prediction.
  • Rendering Sequence Modeling enforces correct depth ordering so overlapping objects appear in the intended visual sequence.
  • Outputs remain fully editable in standard vector tools because the model emits native SVG commands rather than raster approximations.
  • The method scales to a 240k-example corpus of human-designed SVGs and reports state-of-the-art alignment metrics.
  • The same architecture avoids both the error accumulation of flat token sequences and the slow per-example optimization of earlier approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The joint code-image latent space may transfer editing operations learned on pixels back into editable vector commands more reliably than pure code models.
  • Because the VAE sees both modalities, the same framework could be tested on other hybrid representations such as LaTeX or HTML that also have visual renderings.
  • Layer-order modeling might generalize to tasks requiring consistent depth ordering in 3-D scene descriptions generated from text.
  • If the latent space proves stable, downstream applications could add user-specified constraints directly in the diffusion stage without retraining.

Load-bearing premise

Jointly encoding SVG code together with its rendered pixel image creates a latent space from which diffusion can recover coherent, layered, and editable vector output.

What would settle it

Human or automatic evaluations showing that SVGFusion outputs contain more structural mismatches with the text prompt or lose editability compared with strong LLM-based baselines on the same prompts.

Figures

Figures reproduced from arXiv: 2412.10437 by Buyu Li, Dong Xu, Jing Zhang, Juncheng Hu, Qian Yu, Sheng Wang, Ximing Xing, Ziteng Xue.

Figure 1
Figure 1. Figure 1: Example SVGs generated by our SVGFusion. Our proposed method, SVGFusion can generate SVGs with (a) reason￾able construction, (b) a clear and systematic layering structure, and (c) highly editability. Abstract In this work, we introduce SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data with￾out relying on text-based discrete language models or prolonged Score Distillation Sampling (SD… view at source ↗
Figure 2
Figure 2. Figure 2: An Overview of SVGFusion. (a) The pipeline begins with the representation of SVGs, where XML-defined SVG code is converted into an SVG embedding (Sec. 3.1). (b) We first train a Vector-Pixel Fusion Variational Autoencoder (VP-VAE, Sec. 3.2) with a transformer-based architecture to learn a continuous latent space for SVGs by incorporating features from both SVG codes and their rendered images. (c) The Vecto… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the SVG embedding process. SVG code is initially converted into a matrix representation that in￾cludes geometric attributes, colors, and opacity. This matrix is subsequently mapped into a tensor via SVG embeddings. <rect>), commands (e.g. M, C in <path> element), and attributes (e.g. d, r or fill). Inspired by prior works [4, 58], we transform these instructions into a structured, rule-base… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the Vector-Pixel Fusion Encoding. The VP-VAE encoder integrates the SVG embeddings (Q) with pixel embeddings (K, V ) using a cross-attention layer. After processing through L self-attention layers, the encoded features are mapped to a latent space, where the mean and stan￾dard deviation are computed for a probabilistic representation. A latent variable z is sampled using the reparameterizat… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Comparison of SVGFusion and Existing Text-to-SVG Methods. The target SVGs are in the emoji style. We use prompt modifiers for the optimization-based approach to encourage the appropriate style: “minimal flat 2D vector icon, emoji icon, lineal color, on a white background, trending on ArtStation.” Note that although the visual quality of results generated by optimization-based methods is high, t… view at source ↗
Figure 6
Figure 6. Figure 6: Path Rendering Sequence. SVGFusion is designed to align with human logic in SVG creation. The top diagram illustrates the left-to-right sequence of object placement, while the bottom diagram depicts the drawing order of an object from simple to complex. specific part and gradually add elements to complete the SVG. This coincides with the process of creating SVGs by human designers. Comparison with Large La… view at source ↗
Figure 8
Figure 8. Figure 8: Effects of VP-VAE and Rendering Sequence Modeling. (a) vs. (c) demonstrates that VP-VAE cannot accu￾rately reconstruct or generate shapes without the Vector-Pixel Fusion (incorporating DINOv2 visual prior [31]). (b) vs. (c) indicates that employing Rendering Sequence Modeling re￾sults in more reasonable SVG outcomes, due to the order of primitives being better aligned with human creation logic. and without… view at source ↗
read the original abstract

Generating high-quality Scalable Vector Graphics (SVGs) from text remains a significant challenge. Existing LLM-based models that generate SVG code as a flat token sequence struggle with poor structural understanding and error accumulation, while optimization-based methods are slow and yield uneditable outputs. To address these limitations, we introduce SVGFusion, a unified framework that adapts the VAE-diffusion architecture to bridge the dual code-visual nature of SVGs. Our model features two core components: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that learns a perceptually rich latent space by jointly encoding SVG code and its rendered image, and a Vector Space Diffusion Transformer (VS-DiT) that achieves globally coherent compositions through iterative refinement. Furthermore, this architecture is enhanced by a Rendering Sequence Modeling strategy, which ensures accurate object layering and occlusion. Evaluated on our novel SVGX-Dataset comprising 240k human-designed SVGs, SVGFusion establishes a new state-of-the-art, generating high-quality, editable SVGs that are strictly semantically aligned with the input text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces SVGFusion, a VAE-diffusion framework for text-to-SVG generation. It features a Vector-Pixel Fusion VAE (VP-VAE) that jointly encodes SVG code and rendered images to produce a latent space, a Vector Space Diffusion Transformer (VS-DiT) for iterative refinement of globally coherent compositions, and Rendering Sequence Modeling to handle layering and occlusion. The model is trained and evaluated on the new SVGX-Dataset of 240k human-designed SVGs and claims to achieve state-of-the-art results in generating high-quality, editable SVGs that are strictly semantically aligned with input text.

Significance. If the empirical claims hold, the work would offer a meaningful step forward in text-conditioned vector graphics synthesis by explicitly bridging the code and visual modalities of SVGs within a single latent space and diffusion process. The joint encoding strategy and sequence modeling for occlusion are conceptually well-motivated relative to prior LLM token-sequence or optimization-based baselines.

major comments (2)
  1. [Abstract] Abstract: The central claim that SVGFusion 'establishes a new state-of-the-art' is unsupported by any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes the performance assertion impossible to evaluate and is load-bearing for the paper's primary contribution.
  2. [Abstract] Abstract (VP-VAE description): The assertion that the Vector-Pixel Fusion VAE 'learns a perceptually rich latent space by jointly encoding SVG code and its rendered image' is presented without any supporting reconstruction loss values, latent-space alignment metrics, interpolation results, or ablation (e.g., pixel branch removed) demonstrating that the fusion step improves semantic fidelity or structural coherence for the downstream VS-DiT. This is the least-secured link between architecture and claimed performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on the abstract. We agree that the abstract must be revised to ensure all performance and architectural claims are directly supported by evidence reported in the manuscript, and we will make the necessary changes.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that SVGFusion 'establishes a new state-of-the-art' is unsupported by any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes the performance assertion impossible to evaluate and is load-bearing for the paper's primary contribution.

    Authors: We accept this criticism. The current abstract states the SOTA claim without embedding the supporting numbers or comparisons that appear later in the paper. In the revised version we will either (a) insert concise quantitative results (e.g., key FID, CLIP-score, or editability metrics versus the strongest baselines) or (b) qualify the claim to reflect exactly what the experiments demonstrate. This change will be made. revision: yes

  2. Referee: [Abstract] Abstract (VP-VAE description): The assertion that the Vector-Pixel Fusion VAE 'learns a perceptually rich latent space by jointly encoding SVG code and its rendered image' is presented without any supporting reconstruction loss values, latent-space alignment metrics, interpolation results, or ablation (e.g., pixel branch removed) demonstrating that the fusion step improves semantic fidelity or structural coherence for the downstream VS-DiT. This is the least-secured link between architecture and claimed performance.

    Authors: We agree that the abstract's phrasing for the VP-VAE currently lacks direct evidentiary anchors. The manuscript contains reconstruction losses, alignment metrics, and ablations for the fusion design in Sections 3 and 4; however, these are not referenced in the abstract. We will revise the abstract sentence to either cite the relevant quantitative improvements or adopt more measured language that does not overstate what is shown. This revision will be incorporated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is architectural description without self-referential reductions.

full rationale

The paper introduces SVGFusion via VP-VAE joint encoding and VS-DiT refinement, evaluated on SVGX-Dataset. The abstract and description contain no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim rests on the proposed architecture and external evaluation rather than any definitional loop or renamed known result. This qualifies as self-contained with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or postulated physical entities; the model components themselves are described at high level only.

pith-pipeline@v0.9.0 · 5740 in / 1170 out tokens · 32260 ms · 2026-05-23T07:19:13.504170+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  2. Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

    cs.CV 2026-04 unverdicted novelty 7.0

    Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...

  3. mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

  4. LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.

  5. Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

    cs.LG 2026-04 unverdicted novelty 7.0

    HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.

  6. Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching

    cs.CV 2026-02 unverdicted novelty 7.0

    Stroke of Surprise is a framework that generates vector sketches undergoing semantic transformation from one concept to another by adding strokes, using dual-branch SDS and overlay loss for optimization.

  7. Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning

    cs.CV 2025-05 conditional novelty 7.0

    Reason-SVG adds a Drawing-with-Thought reasoning stage and GRPO-based reinforcement learning with a hybrid reward to improve LLM and VLM performance on accurate SVG generation.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 7 Pith papers · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 7

  2. [2]

    Claude 3.5 sonnet

    Anthropic. Claude 3.5 sonnet. https : / / www . anthropic.com/news/claude- 3- 5- sonnet ,

  3. [3]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023. 3

  4. [4]

    Deepsvg: A hierarchical genera- tive network for vector graphics animation

    Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical genera- tive network for vector graphics animation. Advances in Neural Information Processing Systems (NeurIPS) , 33: 16351–16361, 2020. 2, 3, 4, 6, 14, 16

  5. [5]

    Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis

    Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis. In The Twelfth International Confer- ence on Learning Representations (ICLR), 2024. 6

  6. [6]

    FIGR: Few-shot Image Generation with Reptile

    Louis Clou ˆatre and Marc Demers. Figr: Few- shot image generation with reptile. arXiv preprint arXiv:1901.02199, 2019. 3

  7. [7]

    Imagenet: A large-scale hierarchical im- age database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255,

  8. [8]

    Diffusion mod- els beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural in- formation processing systems (NeurIPS), 34:8780–8794,

  9. [9]

    CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders

    Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders. In Advances in Neural Infor- mation Processing Systems (NeurIPS), 2022. 2, 3, 6, 7

  10. [10]

    Noto emoji fonts

    Google. Noto emoji fonts. https://github.com/ googlefonts/noto-emoji, 2014. 3, 6, 12

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 7

  12. [12]

    Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

    Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations (ICLR), 2024. 3

  13. [13]

    A neural representation of sketch drawings

    David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learn- ing Representations (ICLR), 2018. 2, 3

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems (NeurIPS), 30, 2017. 6

  15. [15]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 6

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840– 6851, 2020. 3

  17. [17]

    Video diffusion models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Pro- cessing Systems (NeurIPS), 35:8633–8646, 2022. 3

  18. [18]

    Supersvg: Superpixel- based scalable vector graphics synthesis

    Teng Hu, Ran Yi, Baihong Qian, Jiangning Zhang, Paul L Rosin, and Yu-Kun Lai. Supersvg: Superpixel- based scalable vector graphics synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24892–24901, 2024. 3

  19. [19]

    Word-as-image for seman- tic typography

    Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for seman- tic typography. ACM Transactions on Graphics (TOG), 42(4), 2023. 6, 7

  20. [20]

    Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7

  21. [21]

    Differentiable vector graphics rasterization for editing and learning

    Tzu-Mao Li, Michal Luk ´aˇc, Gharbi Micha ¨el, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020. 2, 3, 7

  22. [22]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 3

  23. [23]

    A learned representation for scalable vector graphics

    Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , 2019. 3, 6

  24. [24]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Sys- tems (NeurIPS), 35:5775–5787, 2022. 6, 12 9

  25. [25]

    Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2024. 3

  26. [26]

    Towards layer-wise image vectorization

    Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer-wise image vectorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16314–16323, 2022. 3, 7

  27. [27]

    Fluent emoji

    Microsoft. Fluent emoji. https://github.com/ microsoft/fluentui-emoji, 2021. 6, 12

  28. [28]

    Im- proved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International conference on machine learning (ICLR) , pages 8162–8171, 2021. 3

  29. [29]

    GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models

    Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th Interna- tional Conference on Machine Learning (ICML) , pages 16784–16804, 2022. 3

  30. [30]

    Introducing chatgpt

    OpenAI. Introducing chatgpt. https://openai. com/index/chatgpt/, 2023. 12

  31. [31]

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

  32. [32]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2, 3, 6, 8

  33. [33]

    SDXL: Improving latent diffu- sion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffu- sion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Represen- tations (ICLR), 2024. 3

  34. [34]

    Barron, and Ben Mildenhall

    Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023. 3

  35. [35]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International Conference on Ma- chine Learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 6, 7

  36. [36]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3

  37. [37]

    Im2vec: Synthesizing vector graph- ics without vector supervision

    Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graph- ics without vector supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7342–7351, 2021. 3

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2, 3, 6, 7

  39. [39]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neu- ral Information Processing Systems (NeurIPS) , pages 36479–36494, 2022. 3

  40. [40]

    Improved aesthetic predictor

    Christoph Schuhmann. Improved aesthetic predictor. https : / / github . com / christophschuhmann / improved - aesthetic-predictor, 2022. 6

  41. [41]

    Make-a-video: Text-to-video generation without text-video data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations (ICLR) , 2023. 3

  42. [42]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), pages 2256–2265, 2015. 3

  43. [43]

    De- noising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Con- ference on Learning Representations (ICLR), 2021. 12

  44. [44]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS) , 2019

  45. [45]

    Score- based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. In International Conference on Learning Rep- resentations (ICLR), 2021. 3

  46. [46]

    Clipvg: Text-guided image manipulation using differentiable vector graphics

    Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, and Minzhe Li. Clipvg: Text-guided image manipulation using differentiable vector graphics. In Proceedings of the Conference on Artificial Intelli- gence (AAAI), 2023. 2, 3

  47. [47]

    If by deepfloyd lab at stabilityai

    StabilityAI. If by deepfloyd lab at stabilityai. https: //github.com/deep-floyd/IF, 2023. 3 10

  48. [48]

    Roformer: Enhanced trans- former with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding. Neurocomput., 568(C), 2024. 12

  49. [49]

    Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis

    Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis. arXiv preprint arXiv:2401.17093, 2024. 2, 3, 6, 16

  50. [50]

    Vecfusion: Vector font generation with diffusion

    Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Ja- cobson, and Evangelos Kalogerakis. Vecfusion: Vector font generation with diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7943–7952, 2024. 3

  51. [51]

    Nivel: Neural implicit vector layers for text-to-vector generation

    Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanx- uan Zhao, Evangelos Kalogerakis, and Michal Lukac. Nivel: Neural implicit vector layers for text-to-vector generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4589–4597, 2024. 3

  52. [52]

    Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts

    Yingtao Tian and David Ha. Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts. In Artificial Intelligence in Music, Sound, Art and Design, pages 275–291. Springer, 2022. 2, 3, 6, 7

  53. [53]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient founda- tion language models. ArXiv, abs/2302.13971, 2023. 12

  54. [54]

    Twitter color emoji svginot font

    Twitter. Twitter color emoji svginot font. https:// github.com/13rac1/twemoji- color- font ,

  55. [55]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2017. 12

  56. [56]

    Clipasso: Semantically-aware object sketching

    Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Ro- man Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. 2, 3

  57. [57]

    Yeh, and Greg Shakhnarovich

    Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d gen- eration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12619–12629, 2023. 3

  58. [58]

    Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning

    Yizhi Wang and Zhouhui Lian. Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning. ACM Transactions on Graphics (TOG), 40(6), 2021. 2, 3, 4, 14, 16

  59. [59]

    Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality

    Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 18320– 18328, 2023. 2, 16

  60. [60]

    Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) ,

  61. [61]

    Reshot - free icons & illustrations

    ReShot website. Reshot - free icons & illustrations. de- sign freely with instant downloads and commercial li- censes. https://www.reshot.com/, . 6, 12

  62. [62]

    Open-licensed svg vector and icons

    SVGRepo website. Open-licensed svg vector and icons. https://www.svgrepo.com/, . 6, 12

  63. [63]

    Icon- shop: Text-based vector icon synthesis with autoregressive transformers

    Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Iconshop: Text-based vector icon synthesis with autore- gressive transformers. arXiv preprint arXiv:2304.14400,

  64. [64]

    2, 3, 4, 6, 7, 14, 16

  65. [65]

    Human preference score: Better aligning text-to-image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2096–2105, 2023. 6

  66. [66]

    Diffsketcher: Text guided vector sketch synthesis through latent diffusion models

    Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 3, 6, 7

  67. [67]

    Empowering llms to understand and generate complex vector graphics

    Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and generate complex vector graphics. arXiv preprint arXiv:2412.11102, 2024. 2

  68. [68]

    Svgdreamer++: Advancing editability and diversity in text-guided svg generation

    Ximing Xing, Qian Yu, Chuang Wang, Haitao Zhou, Jing Zhang, and Dong Xu. Svgdreamer++: Advancing editability and diversity in text-guided svg generation. arXiv preprint arXiv:2411.17832, 2024. 3

  69. [69]

    Svgdreamer: Text guided svg generation with diffusion model

    Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4546–4555, 2024. 2, 3, 6, 7

  70. [70]

    arrow”, “circle

    Peiying Zhang, Nanxuan Zhao, and Jing Liao. Text-to- vector generation with neural path representation. ACM Trans. Graph., 43(4), 2024. 3, 14 11 SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion Supplementary Material Overview This supplementary material provides additional details and analyses related to SVGFusion, organized as fol- l...