SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Buyu Li; Dong Xu; Jing Zhang; Juncheng Hu; Qian Yu; Sheng Wang; Ximing Xing; Ziteng Xue

arxiv: 2412.10437 · v3 · submitted 2024-12-11 · 💻 cs.CV · cs.GR· cs.LG

SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

Ximing Xing , Juncheng Hu , Ziteng Xue , Jing Zhang , Buyu Li , Sheng Wang , Dong Xu , Qian Yu This is my paper

Pith reviewed 2026-05-23 07:19 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG

keywords SVG generationtext-to-vectorVAEdiffusion transformervector graphicsgenerative modelslatent space fusionrendering sequence

0 comments

The pith

SVGFusion fuses SVG code and rendered pixels in a VAE then diffuses the result to produce editable text-aligned vector graphics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to generate Scalable Vector Graphics from text prompts in a way that preserves editability and structural coherence. Existing sequence-based generators accumulate errors in object relations while optimization approaches produce fixed outputs that cannot be modified afterward. SVGFusion therefore trains a single VAE to encode both the raw SVG commands and the pixel image they produce, creating a shared latent space. A diffusion transformer then samples from this space while a separate sequence model enforces correct layering order. If the approach holds, text-to-vector generation becomes both faster and more practical for downstream editing tasks.

Core claim

The central claim is that a Vector-Pixel Fusion VAE jointly encoding SVG code and its rendered image learns a latent space rich enough for a Vector Space Diffusion Transformer to perform iterative refinement, and that adding Rendering Sequence Modeling ensures correct object layering and occlusion, yielding high-quality editable SVGs that remain strictly aligned with the input text on a 240k-example dataset.

What carries the argument

The Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that jointly encodes SVG code and its rendered image to produce the latent space operated on by the diffusion transformer.

If this is right

The diffusion process produces globally coherent compositions through iterative denoising rather than one-shot token prediction.
Rendering Sequence Modeling enforces correct depth ordering so overlapping objects appear in the intended visual sequence.
Outputs remain fully editable in standard vector tools because the model emits native SVG commands rather than raster approximations.
The method scales to a 240k-example corpus of human-designed SVGs and reports state-of-the-art alignment metrics.
The same architecture avoids both the error accumulation of flat token sequences and the slow per-example optimization of earlier approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The joint code-image latent space may transfer editing operations learned on pixels back into editable vector commands more reliably than pure code models.
Because the VAE sees both modalities, the same framework could be tested on other hybrid representations such as LaTeX or HTML that also have visual renderings.
Layer-order modeling might generalize to tasks requiring consistent depth ordering in 3-D scene descriptions generated from text.
If the latent space proves stable, downstream applications could add user-specified constraints directly in the diffusion stage without retraining.

Load-bearing premise

Jointly encoding SVG code together with its rendered pixel image creates a latent space from which diffusion can recover coherent, layered, and editable vector output.

What would settle it

Human or automatic evaluations showing that SVGFusion outputs contain more structural mismatches with the text prompt or lose editability compared with strong LLM-based baselines on the same prompts.

Figures

Figures reproduced from arXiv: 2412.10437 by Buyu Li, Dong Xu, Jing Zhang, Juncheng Hu, Qian Yu, Sheng Wang, Ximing Xing, Ziteng Xue.

**Figure 1.** Figure 1: Example SVGs generated by our SVGFusion. Our proposed method, SVGFusion can generate SVGs with (a) reasonable construction, (b) a clear and systematic layering structure, and (c) highly editability. Abstract In this work, we introduce SVGFusion, a Text-to-SVG model capable of scaling to real-world SVG data without relying on text-based discrete language models or prolonged Score Distillation Sampling (SD… view at source ↗

**Figure 2.** Figure 2: An Overview of SVGFusion. (a) The pipeline begins with the representation of SVGs, where XML-defined SVG code is converted into an SVG embedding (Sec. 3.1). (b) We first train a Vector-Pixel Fusion Variational Autoencoder (VP-VAE, Sec. 3.2) with a transformer-based architecture to learn a continuous latent space for SVGs by incorporating features from both SVG codes and their rendered images. (c) The Vecto… view at source ↗

**Figure 3.** Figure 3: Illustration of the SVG embedding process. SVG code is initially converted into a matrix representation that includes geometric attributes, colors, and opacity. This matrix is subsequently mapped into a tensor via SVG embeddings. <rect>), commands (e.g. M, C in <path> element), and attributes (e.g. d, r or fill). Inspired by prior works [4, 58], we transform these instructions into a structured, rule-base… view at source ↗

**Figure 4.** Figure 4: Illustration of the Vector-Pixel Fusion Encoding. The VP-VAE encoder integrates the SVG embeddings (Q) with pixel embeddings (K, V ) using a cross-attention layer. After processing through L self-attention layers, the encoded features are mapped to a latent space, where the mean and standard deviation are computed for a probabilistic representation. A latent variable z is sampled using the reparameterizat… view at source ↗

**Figure 5.** Figure 5: Qualitative Comparison of SVGFusion and Existing Text-to-SVG Methods. The target SVGs are in the emoji style. We use prompt modifiers for the optimization-based approach to encourage the appropriate style: “minimal flat 2D vector icon, emoji icon, lineal color, on a white background, trending on ArtStation.” Note that although the visual quality of results generated by optimization-based methods is high, t… view at source ↗

**Figure 6.** Figure 6: Path Rendering Sequence. SVGFusion is designed to align with human logic in SVG creation. The top diagram illustrates the left-to-right sequence of object placement, while the bottom diagram depicts the drawing order of an object from simple to complex. specific part and gradually add elements to complete the SVG. This coincides with the process of creating SVGs by human designers. Comparison with Large La… view at source ↗

**Figure 8.** Figure 8: Effects of VP-VAE and Rendering Sequence Modeling. (a) vs. (c) demonstrates that VP-VAE cannot accurately reconstruct or generate shapes without the Vector-Pixel Fusion (incorporating DINOv2 visual prior [31]). (b) vs. (c) indicates that employing Rendering Sequence Modeling results in more reasonable SVG outcomes, due to the order of primitives being better aligned with human creation logic. and without… view at source ↗

read the original abstract

Generating high-quality Scalable Vector Graphics (SVGs) from text remains a significant challenge. Existing LLM-based models that generate SVG code as a flat token sequence struggle with poor structural understanding and error accumulation, while optimization-based methods are slow and yield uneditable outputs. To address these limitations, we introduce SVGFusion, a unified framework that adapts the VAE-diffusion architecture to bridge the dual code-visual nature of SVGs. Our model features two core components: a Vector-Pixel Fusion Variational Autoencoder (VP-VAE) that learns a perceptually rich latent space by jointly encoding SVG code and its rendered image, and a Vector Space Diffusion Transformer (VS-DiT) that achieves globally coherent compositions through iterative refinement. Furthermore, this architecture is enhanced by a Rendering Sequence Modeling strategy, which ensures accurate object layering and occlusion. Evaluated on our novel SVGX-Dataset comprising 240k human-designed SVGs, SVGFusion establishes a new state-of-the-art, generating high-quality, editable SVGs that are strictly semantically aligned with the input text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SVGFusion puts forward a hybrid VP-VAE plus VS-DiT architecture for text-to-editable-SVG that targets structural and semantic issues, but the abstract supplies no metrics to back the SOTA claim or the value of the joint encoding.

read the letter

The main point is a new VAE-diffusion setup for turning text into editable SVGs. It uses a VP-VAE that encodes both the vector code and the rendered pixel image to build a shared latent space, then a VS-DiT that does iterative refinement, plus a rendering sequence step to manage layering and occlusion. They also release a 240k SVG dataset. This directly tackles the flat-token problems in LLM-based SVG generators and the slowness of optimization methods, which is a sensible direction for the dual code-visual character of SVGs.

Referee Report

2 major / 0 minor

Summary. The paper introduces SVGFusion, a VAE-diffusion framework for text-to-SVG generation. It features a Vector-Pixel Fusion VAE (VP-VAE) that jointly encodes SVG code and rendered images to produce a latent space, a Vector Space Diffusion Transformer (VS-DiT) for iterative refinement of globally coherent compositions, and Rendering Sequence Modeling to handle layering and occlusion. The model is trained and evaluated on the new SVGX-Dataset of 240k human-designed SVGs and claims to achieve state-of-the-art results in generating high-quality, editable SVGs that are strictly semantically aligned with input text.

Significance. If the empirical claims hold, the work would offer a meaningful step forward in text-conditioned vector graphics synthesis by explicitly bridging the code and visual modalities of SVGs within a single latent space and diffusion process. The joint encoding strategy and sequence modeling for occlusion are conceptually well-motivated relative to prior LLM token-sequence or optimization-based baselines.

major comments (2)

[Abstract] Abstract: The central claim that SVGFusion 'establishes a new state-of-the-art' is unsupported by any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes the performance assertion impossible to evaluate and is load-bearing for the paper's primary contribution.
[Abstract] Abstract (VP-VAE description): The assertion that the Vector-Pixel Fusion VAE 'learns a perceptually rich latent space by jointly encoding SVG code and its rendered image' is presented without any supporting reconstruction loss values, latent-space alignment metrics, interpolation results, or ablation (e.g., pixel branch removed) demonstrating that the fusion step improves semantic fidelity or structural coherence for the downstream VS-DiT. This is the least-secured link between architecture and claimed performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on the abstract. We agree that the abstract must be revised to ensure all performance and architectural claims are directly supported by evidence reported in the manuscript, and we will make the necessary changes.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that SVGFusion 'establishes a new state-of-the-art' is unsupported by any reported quantitative metrics, baseline comparisons, ablation studies, or error analysis. This absence makes the performance assertion impossible to evaluate and is load-bearing for the paper's primary contribution.

Authors: We accept this criticism. The current abstract states the SOTA claim without embedding the supporting numbers or comparisons that appear later in the paper. In the revised version we will either (a) insert concise quantitative results (e.g., key FID, CLIP-score, or editability metrics versus the strongest baselines) or (b) qualify the claim to reflect exactly what the experiments demonstrate. This change will be made. revision: yes
Referee: [Abstract] Abstract (VP-VAE description): The assertion that the Vector-Pixel Fusion VAE 'learns a perceptually rich latent space by jointly encoding SVG code and its rendered image' is presented without any supporting reconstruction loss values, latent-space alignment metrics, interpolation results, or ablation (e.g., pixel branch removed) demonstrating that the fusion step improves semantic fidelity or structural coherence for the downstream VS-DiT. This is the least-secured link between architecture and claimed performance.

Authors: We agree that the abstract's phrasing for the VP-VAE currently lacks direct evidentiary anchors. The manuscript contains reconstruction losses, alignment metrics, and ablations for the fusion design in Sections 3 and 4; however, these are not referenced in the abstract. We will revise the abstract sentence to either cite the relevant quantitative improvements or adopt more measured language that does not overstate what is shown. This revision will be incorporated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is architectural description without self-referential reductions.

full rationale

The paper introduces SVGFusion via VP-VAE joint encoding and VS-DiT refinement, evaluated on SVGX-Dataset. The abstract and description contain no equations, fitted parameters, or predictions that reduce to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The central claim rests on the proposed architecture and external evaluation rather than any definitional loop or renamed known result. This qualifies as self-contained with score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, mathematical axioms, or postulated physical entities; the model components themselves are described at high level only.

pith-pipeline@v0.9.0 · 5740 in / 1170 out tokens · 32260 ms · 2026-05-23T07:19:13.504170+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VP-VAE ... jointly encoding SVG code and its rendered image ... Rendering Sequence Modeling strategy ... VS-DiT ... diffusion process in the latent space
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat (8-tick / orbit structure) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

N = 1024 ... incremental accumulation of SVG codes ... progressive sequence of drawing steps
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no reconstruction loss, latent interpolation results, or ablation showing that removing the pixel branch degrades semantic fidelity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
LottieGPT: Tokenizing Vector Animation for Autoregressive Generation
cs.CV 2026-04 unverdicted novelty 7.0

LottieGPT tokenizes Lottie animations into compact sequences and fine-tunes Qwen-VL to autoregressively generate coherent vector animations from natural language or visual prompts, outperforming prior SVG models.
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
cs.LG 2026-04 unverdicted novelty 7.0

HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
Stroke of Surprise: Progressive Semantic Illusions in Vector Sketching
cs.CV 2026-02 unverdicted novelty 7.0

Stroke of Surprise is a framework that generates vector sketches undergoing semantic transformation from one concept to another by adding strokes, using dual-branch SDS and overlay loss for optimization.
Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning
cs.CV 2025-05 conditional novelty 7.0

Reason-SVG adds a Drawing-with-Thought reasoning stage and GRPO-based reinforcement learning with a hybrid reward to improve LLM and VLM performance on accurate SVG generation.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 7 Pith papers · 7 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https : / / www . anthropic.com/news/claude- 3- 5- sonnet ,

work page
[3]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023. 3

work page 2023
[4]

Deepsvg: A hierarchical genera- tive network for vector graphics animation

Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical genera- tive network for vector graphics animation. Advances in Neural Information Processing Systems (NeurIPS) , 33: 16351–16361, 2020. 2, 3, 4, 6, 14, 16

work page 2020
[5]

Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis. In The Twelfth International Confer- ence on Learning Representations (ICLR), 2024. 6

work page 2024
[6]

FIGR: Few-shot Image Generation with Reptile

Louis Clou ˆatre and Marc Demers. Figr: Few- shot image generation with reptile. arXiv preprint arXiv:1901.02199, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901
[7]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255,

work page 2009
[8]

Diffusion mod- els beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural in- formation processing systems (NeurIPS), 34:8780–8794,

work page
[9]

CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders

Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders. In Advances in Neural Infor- mation Processing Systems (NeurIPS), 2022. 2, 3, 6, 7

work page 2022
[10]

Noto emoji fonts

Google. Noto emoji fonts. https://github.com/ googlefonts/noto-emoji, 2014. 3, 6, 12

work page 2014
[11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations (ICLR), 2024. 3

work page 2024
[13]

A neural representation of sketch drawings

David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learn- ing Representations (ICLR), 2018. 2, 3

work page 2018
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems (NeurIPS), 30, 2017. 6

work page 2017
[15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840– 6851, 2020. 3

work page 2020
[17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Pro- cessing Systems (NeurIPS), 35:8633–8646, 2022. 3

work page 2022
[18]

Supersvg: Superpixel- based scalable vector graphics synthesis

Teng Hu, Ran Yi, Baihong Qian, Jiangning Zhang, Paul L Rosin, and Yu-Kun Lai. Supersvg: Superpixel- based scalable vector graphics synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24892–24901, 2024. 3

work page 2024
[19]

Word-as-image for seman- tic typography

Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for seman- tic typography. ACM Transactions on Graphics (TOG), 42(4), 2023. 6, 7

work page 2023
[20]

Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7

work page 2023
[21]

Differentiable vector graphics rasterization for editing and learning

Tzu-Mao Li, Michal Luk ´aˇc, Gharbi Micha ¨el, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020. 2, 3, 7

work page 2020
[22]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

A learned representation for scalable vector graphics

Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , 2019. 3, 6

work page 2019
[24]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Sys- tems (NeurIPS), 35:5775–5787, 2022. 6, 12 9

work page 2022
[25]

Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2024. 3

work page 2024
[26]

Towards layer-wise image vectorization

Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer-wise image vectorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16314–16323, 2022. 3, 7

work page 2022
[27]

Fluent emoji

Microsoft. Fluent emoji. https://github.com/ microsoft/fluentui-emoji, 2021. 6, 12

work page 2021
[28]

Im- proved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International conference on machine learning (ICLR) , pages 8162–8171, 2021. 3

work page 2021
[29]

GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th Interna- tional Conference on Machine Learning (ICML) , pages 16784–16804, 2022. 3

work page 2022
[30]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai. com/index/chatgpt/, 2023. 12

work page 2023
[31]

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

work page 2024
[32]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2, 3, 6, 8

work page 2023
[33]

SDXL: Improving latent diffu- sion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffu- sion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Represen- tations (ICLR), 2024. 3

work page 2024
[34]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023. 3

work page 2023
[35]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International Conference on Ma- chine Learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 6, 7

work page 2021
[36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Im2vec: Synthesizing vector graph- ics without vector supervision

Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graph- ics without vector supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7342–7351, 2021. 3

work page 2021
[38]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2, 3, 6, 7

work page 2022
[39]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neu- ral Information Processing Systems (NeurIPS) , pages 36479–36494, 2022. 3

work page 2022
[40]

Improved aesthetic predictor

Christoph Schuhmann. Improved aesthetic predictor. https : / / github . com / christophschuhmann / improved - aesthetic-predictor, 2022. 6

work page 2022
[41]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations (ICLR) , 2023. 3

work page 2023
[42]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), pages 2256–2265, 2015. 3

work page 2015
[43]

De- noising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Con- ference on Learning Representations (ICLR), 2021. 12

work page 2021
[44]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS) , 2019

work page 2019
[45]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. In International Conference on Learning Rep- resentations (ICLR), 2021. 3

work page 2021
[46]

Clipvg: Text-guided image manipulation using differentiable vector graphics

Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, and Minzhe Li. Clipvg: Text-guided image manipulation using differentiable vector graphics. In Proceedings of the Conference on Artificial Intelli- gence (AAAI), 2023. 2, 3

work page 2023
[47]

If by deepfloyd lab at stabilityai

StabilityAI. If by deepfloyd lab at stabilityai. https: //github.com/deep-floyd/IF, 2023. 3 10

work page 2023
[48]

Roformer: Enhanced trans- former with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding. Neurocomput., 568(C), 2024. 12

work page 2024
[49]

Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis

Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis. arXiv preprint arXiv:2401.17093, 2024. 2, 3, 6, 16

work page arXiv 2024
[50]

Vecfusion: Vector font generation with diffusion

Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Ja- cobson, and Evangelos Kalogerakis. Vecfusion: Vector font generation with diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7943–7952, 2024. 3

work page 2024
[51]

Nivel: Neural implicit vector layers for text-to-vector generation

Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanx- uan Zhao, Evangelos Kalogerakis, and Michal Lukac. Nivel: Neural implicit vector layers for text-to-vector generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4589–4597, 2024. 3

work page 2024
[52]

Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts

Yingtao Tian and David Ha. Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts. In Artificial Intelligence in Music, Sound, Art and Design, pages 275–291. Springer, 2022. 2, 3, 6, 7

work page 2022
[53]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient founda- tion language models. ArXiv, abs/2302.13971, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Twitter color emoji svginot font

Twitter. Twitter color emoji svginot font. https:// github.com/13rac1/twemoji- color- font ,

work page
[55]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2017. 12

work page 2017
[56]

Clipasso: Semantically-aware object sketching

Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Ro- man Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. 2, 3

work page 2022
[57]

Yeh, and Greg Shakhnarovich

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d gen- eration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12619–12629, 2023. 3

work page 2023
[58]

Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning

Yizhi Wang and Zhouhui Lian. Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning. ACM Transactions on Graphics (TOG), 40(6), 2021. 2, 3, 4, 14, 16

work page 2021
[59]

Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality

Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 18320– 18328, 2023. 2, 16

work page 2023
[60]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) ,

work page
[61]

Reshot - free icons & illustrations

ReShot website. Reshot - free icons & illustrations. de- sign freely with instant downloads and commercial li- censes. https://www.reshot.com/, . 6, 12

work page
[62]

Open-licensed svg vector and icons

SVGRepo website. Open-licensed svg vector and icons. https://www.svgrepo.com/, . 6, 12

work page
[63]

Icon- shop: Text-based vector icon synthesis with autoregressive transformers

Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Iconshop: Text-based vector icon synthesis with autore- gressive transformers. arXiv preprint arXiv:2304.14400,

work page arXiv
[64]

2, 3, 4, 6, 7, 14, 16

work page
[65]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2096–2105, 2023. 6

work page 2096
[66]

Diffsketcher: Text guided vector sketch synthesis through latent diffusion models

Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 3, 6, 7

work page 2023
[67]

Empowering llms to understand and generate complex vector graphics

Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and generate complex vector graphics. arXiv preprint arXiv:2412.11102, 2024. 2

work page arXiv 2024
[68]

Svgdreamer++: Advancing editability and diversity in text-guided svg generation

Ximing Xing, Qian Yu, Chuang Wang, Haitao Zhou, Jing Zhang, and Dong Xu. Svgdreamer++: Advancing editability and diversity in text-guided svg generation. arXiv preprint arXiv:2411.17832, 2024. 3

work page arXiv 2024
[69]

Svgdreamer: Text guided svg generation with diffusion model

Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4546–4555, 2024. 2, 3, 6, 7

work page 2024
[70]

arrow”, “circle

Peiying Zhang, Nanxuan Zhao, and Jing Liao. Text-to- vector generation with neural path representation. ACM Trans. Graph., 43(4), 2024. 3, 14 11 SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion Supplementary Material Overview This supplementary material provides additional details and analyses related to SVGFusion, organized as fol- l...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Claude 3.5 sonnet

Anthropic. Claude 3.5 sonnet. https : / / www . anthropic.com/news/claude- 3- 5- sonnet ,

work page

[3] [3]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2023. 3

work page 2023

[4] [4]

Deepsvg: A hierarchical genera- tive network for vector graphics animation

Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical genera- tive network for vector graphics animation. Advances in Neural Information Processing Systems (NeurIPS) , 33: 16351–16361, 2020. 2, 3, 4, 6, 14, 16

work page 2020

[5] [5]

Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$ \alpha$: Fast training of diffusion transformer for photorealistic text- to-image synthesis. In The Twelfth International Confer- ence on Learning Representations (ICLR), 2024. 6

work page 2024

[6] [6]

FIGR: Few-shot Image Generation with Reptile

Louis Clou ˆatre and Marc Demers. Figr: Few- shot image generation with reptile. arXiv preprint arXiv:1901.02199, 2019. 3

work page internal anchor Pith review Pith/arXiv arXiv 1901

[7] [7]

Imagenet: A large-scale hierarchical im- age database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical im- age database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255,

work page 2009

[8] [8]

Diffusion mod- els beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion mod- els beat gans on image synthesis. Advances in neural in- formation processing systems (NeurIPS), 34:8780–8794,

work page

[9] [9]

CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders

Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIP- Draw: Exploring text-to-drawing synthesis through language-image encoders. In Advances in Neural Infor- mation Processing Systems (NeurIPS), 2022. 2, 3, 6, 7

work page 2022

[10] [10]

Noto emoji fonts

Google. Noto emoji fonts. https://github.com/ googlefonts/noto-emoji, 2014. 3, 6, 12

work page 2014

[11] [11]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Animatediff: Animate your personalized text-to-image diffusion models without specific tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations (ICLR), 2024. 3

work page 2024

[13] [13]

A neural representation of sketch drawings

David Ha and Douglas Eck. A neural representation of sketch drawings. In International Conference on Learn- ing Representations (ICLR), 2018. 2, 3

work page 2018

[14] [14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems (NeurIPS), 30, 2017. 6

work page 2017

[15] [15]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022. 3, 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), pages 6840– 6851, 2020. 3

work page 2020

[17] [17]

Video diffusion models

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Pro- cessing Systems (NeurIPS), 35:8633–8646, 2022. 3

work page 2022

[18] [18]

Supersvg: Superpixel- based scalable vector graphics synthesis

Teng Hu, Ran Yi, Baihong Qian, Jiangning Zhang, Paul L Rosin, and Yu-Kun Lai. Supersvg: Superpixel- based scalable vector graphics synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24892–24901, 2024. 3

work page 2024

[19] [19]

Word-as-image for seman- tic typography

Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for seman- tic typography. ACM Transactions on Graphics (TOG), 42(4), 2023. 6, 7

work page 2023

[20] [20]

Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models

Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2023. 2, 3, 6, 7

work page 2023

[21] [21]

Differentiable vector graphics rasterization for editing and learning

Tzu-Mao Li, Michal Luk ´aˇc, Gharbi Micha ¨el, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning. ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020. 2, 3, 7

work page 2020

[22] [22]

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

A learned representation for scalable vector graphics

Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vector graphics. In Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV) , 2019. 3, 6

work page 2019

[24] [24]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Sys- tems (NeurIPS), 35:5775–5787, 2022. 6, 12 9

work page 2022

[25] [25]

Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based genera- tive models with scalable interpolant transformers. In Proceedings of the European Conference on Computer Vision (ECCV), 2024. 3

work page 2024

[26] [26]

Towards layer-wise image vectorization

Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer-wise image vectorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16314–16323, 2022. 3, 7

work page 2022

[27] [27]

Fluent emoji

Microsoft. Fluent emoji. https://github.com/ microsoft/fluentui-emoji, 2021. 6, 12

work page 2021

[28] [28]

Im- proved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Im- proved denoising diffusion probabilistic models. In International conference on machine learning (ICLR) , pages 8162–8171, 2021. 3

work page 2021

[29] [29]

GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models

Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards pho- torealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th Interna- tional Conference on Machine Learning (ICML) , pages 16784–16804, 2022. 3

work page 2022

[30] [30]

Introducing chatgpt

OpenAI. Introducing chatgpt. https://openai. com/index/chatgpt/, 2023. 12

work page 2023

[31] [31]

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

work page 2024

[32] [32]

Scalable diffu- sion models with transformers

William Peebles and Saining Xie. Scalable diffu- sion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 2, 3, 6, 8

work page 2023

[33] [33]

SDXL: Improving latent diffu- sion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffu- sion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Represen- tations (ICLR), 2024. 3

work page 2024

[34] [34]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. In The Eleventh International Conference on Learning Representations (ICLR), 2023. 3

work page 2023

[35] [35]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. In International Conference on Ma- chine Learning (ICML), pages 8748–8763. PMLR, 2021. 2, 3, 6, 7

work page 2021

[36] [36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Im2vec: Synthesizing vector graph- ics without vector supervision

Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graph- ics without vector supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7342–7351, 2021. 3

work page 2021

[38] [38]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022. 2, 3, 6, 7

work page 2022

[39] [39]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- mans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neu- ral Information Processing Systems (NeurIPS) , pages 36479–36494, 2022. 3

work page 2022

[40] [40]

Improved aesthetic predictor

Christoph Schuhmann. Improved aesthetic predictor. https : / / github . com / christophschuhmann / improved - aesthetic-predictor, 2022. 6

work page 2022

[41] [41]

Make-a-video: Text-to-video generation without text-video data

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations (ICLR) , 2023. 3

work page 2023

[42] [42]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Mah- eswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning (ICML), pages 2256–2265, 2015. 3

work page 2015

[43] [43]

De- noising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. In International Con- ference on Learning Representations (ICLR), 2021. 12

work page 2021

[44] [44]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. InAdvances in Neural Information Processing Systems (NeurIPS) , 2019

work page 2019

[45] [45]

Score- based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. In International Conference on Learning Rep- resentations (ICLR), 2021. 3

work page 2021

[46] [46]

Clipvg: Text-guided image manipulation using differentiable vector graphics

Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, and Minzhe Li. Clipvg: Text-guided image manipulation using differentiable vector graphics. In Proceedings of the Conference on Artificial Intelli- gence (AAAI), 2023. 2, 3

work page 2023

[47] [47]

If by deepfloyd lab at stabilityai

StabilityAI. If by deepfloyd lab at stabilityai. https: //github.com/deep-floyd/IF, 2023. 3 10

work page 2023

[48] [48]

Roformer: Enhanced trans- former with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced trans- former with rotary position embedding. Neurocomput., 568(C), 2024. 12

work page 2024

[49] [49]

Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis

Zecheng Tang, Chenfei Wu, Zekai Zhang, Mingheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, et al. Strokenuwa: Tokeniz- ing strokes for vector graphic synthesis. arXiv preprint arXiv:2401.17093, 2024. 2, 3, 6, 16

work page arXiv 2024

[50] [50]

Vecfusion: Vector font generation with diffusion

Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Micha¨el Gharbi, Oliver Wang, Alec Ja- cobson, and Evangelos Kalogerakis. Vecfusion: Vector font generation with diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7943–7952, 2024. 3

work page 2024

[51] [51]

Nivel: Neural implicit vector layers for text-to-vector generation

Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanx- uan Zhao, Evangelos Kalogerakis, and Michal Lukac. Nivel: Neural implicit vector layers for text-to-vector generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 4589–4597, 2024. 3

work page 2024

[52] [52]

Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts

Yingtao Tian and David Ha. Modern evolution strategies for creativity: Fitting concrete images and abstract con- cepts. In Artificial Intelligence in Music, Sound, Art and Design, pages 275–291. Springer, 2022. 2, 3, 6, 7

work page 2022

[53] [53]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Bap- tiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient founda- tion language models. ArXiv, abs/2302.13971, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [54]

Twitter color emoji svginot font

Twitter. Twitter color emoji svginot font. https:// github.com/13rac1/twemoji- color- font ,

work page

[55] [55]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS). Curran Associates, Inc., 2017. 12

work page 2017

[56] [56]

Clipasso: Semantically-aware object sketching

Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Ro- man Christian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG), 41(4):1–11, 2022. 2, 3

work page 2022

[57] [57]

Yeh, and Greg Shakhnarovich

Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, and Greg Shakhnarovich. Score jacobian chain- ing: Lifting pretrained 2d diffusion models for 3d gen- eration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12619–12629, 2023. 3

work page 2023

[58] [58]

Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning

Yizhi Wang and Zhouhui Lian. Deepvecfont: Synthesiz- ing high-quality vector fonts via dual-modality learning. ACM Transactions on Graphics (TOG), 40(6), 2021. 2, 3, 4, 14, 16

work page 2021

[59] [59]

Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality

Yuqing Wang, Yizhi Wang, Longhui Yu, Yuesheng Zhu, and Zhouhui Lian. Deepvecfont-v2: Exploiting trans- formers to synthesize vector fonts with higher quality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 18320– 18328, 2023. 2, 16

work page 2023

[60] [60]

Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation

Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with vari- ational score distillation. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS) ,

work page

[61] [61]

Reshot - free icons & illustrations

ReShot website. Reshot - free icons & illustrations. de- sign freely with instant downloads and commercial li- censes. https://www.reshot.com/, . 6, 12

work page

[62] [62]

Open-licensed svg vector and icons

SVGRepo website. Open-licensed svg vector and icons. https://www.svgrepo.com/, . 6, 12

work page

[63] [63]

Icon- shop: Text-based vector icon synthesis with autoregressive transformers

Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Iconshop: Text-based vector icon synthesis with autore- gressive transformers. arXiv preprint arXiv:2304.14400,

work page arXiv

[64] [64]

2, 3, 4, 6, 7, 14, 16

work page

[65] [65]

Human preference score: Better aligning text-to-image models with human preference

Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score: Better aligning text-to-image models with human preference. In Pro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2096–2105, 2023. 6

work page 2096

[66] [66]

Diffsketcher: Text guided vector sketch synthesis through latent diffusion models

Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), 2023. 2, 3, 6, 7

work page 2023

[67] [67]

Empowering llms to understand and generate complex vector graphics

Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and generate complex vector graphics. arXiv preprint arXiv:2412.11102, 2024. 2

work page arXiv 2024

[68] [68]

Svgdreamer++: Advancing editability and diversity in text-guided svg generation

Ximing Xing, Qian Yu, Chuang Wang, Haitao Zhou, Jing Zhang, and Dong Xu. Svgdreamer++: Advancing editability and diversity in text-guided svg generation. arXiv preprint arXiv:2411.17832, 2024. 3

work page arXiv 2024

[69] [69]

Svgdreamer: Text guided svg generation with diffusion model

Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. Svgdreamer: Text guided svg generation with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4546–4555, 2024. 2, 3, 6, 7

work page 2024

[70] [70]

arrow”, “circle

Peiying Zhang, Nanxuan Zhao, and Jing Liao. Text-to- vector generation with neural path representation. ACM Trans. Graph., 43(4), 2024. 3, 14 11 SVGFusion: Scalable Text-to-SVG Generation via Vector Space Diffusion Supplementary Material Overview This supplementary material provides additional details and analyses related to SVGFusion, organized as fol- l...

work page 2024