pith. sign in

arxiv: 2505.24499 · v2 · submitted 2025-05-30 · 💻 cs.CV

Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning

Pith reviewed 2026-05-19 12:56 UTC · model grok-4.3

classification 💻 cs.CV
keywords SVG generationstructured reasoningreinforcement learningDrawing-with-Thoughtlarge language modelsvector graphicshybrid rewardsupervised fine-tuning
0
0 comments X

The pith

Explicit design reasoning during training lets language models create more accurate vector graphics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that large language models can generate better Scalable Vector Graphics when they are trained to output explicit design thoughts alongside the code itself. The approach uses supervised fine-tuning on reasoned examples followed by reinforcement learning driven by a hybrid reward that scores both the reasoning quality and the final graphic's structure, semantics, and appearance. A reader would care because current models frequently produce invalid paths, wrong shapes, or visuals that do not match the prompt, limiting practical use in illustration and design. The work also supplies a 10,000-pair dataset of SVG code paired with its design rationale to make the method reproducible.

Core claim

Reason-SVG introduces the Drawing-with-Thought paradigm in which the model must generate both SVG code and explicit design rationales. A first supervised stage on the SVGX-DwT-10k dataset builds basic reasoning ability, after which reinforcement learning with Group Relative Policy Optimization and a hybrid reward refines the outputs for structural validity, semantic alignment, and visual coherence, yielding measurable gains for both language models and vision-language models.

What carries the argument

The Drawing-with-Thought (DwT) paradigm, in which the model produces both SVG code and explicit design rationales that guide generation.

If this is right

  • Generated SVGs exhibit higher rates of structural validity with fewer broken paths or overlapping shapes.
  • Semantic alignment improves so that the graphics more closely reflect the content of the input description.
  • Visual coherence rises, producing results that appear more polished without direct pixel supervision.
  • The same pipeline lifts performance on both pure language models and those that also process images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same explicit-reasoning training pattern could be applied to other structured code outputs such as HTML layouts or diagram specifications.
  • Well-designed hybrid rewards might reduce the volume of human-labeled data needed for creative generation tasks.
  • The stored rationales could later support interactive editing, where a user modifies the reasoning steps rather than the code directly.

Load-bearing premise

The hybrid reward function is assumed to correctly measure the presence of useful design reasoning together with structural, semantic, and visual quality without the model learning to exploit scoring loopholes.

What would settle it

If a model trained with the full DwT-plus-RL pipeline produces SVGs whose rendered images match input prompts no better than a standard supervised baseline, as measured by human ratings or automated structural checks on a held-out prompt set, the benefit of the added reasoning stage would be refuted.

Figures

Figures reproduced from arXiv: 2505.24499 by Dong Xu, Jing Zhang, Qian Yu, Ximing Xing, Yandong Guan, Ziteng Xue.

Figure 1
Figure 1. Figure 1: Overview of Reason-SVG. Reason-SVG incorporates structured reasoning through the Drawing-with-Thought (DwT) paradigm, enabling LLMs to synthesize SVGs guided by explicit visual planning and compositional logic. (a) DwT Reasoning Process: An example of the Drawing-with-Thought reasoning process, illustrating structured design decisions across stages such as conceptual design, preliminary design, and detaile… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of Reason-SVG. The “Drawing-with-Thought” (DwT, Sec. 4.1) module guides the LLM through a step-by-step visual reasoning process to generate both the SVG code (O) and its corresponding design rationale (C). This process comprises the following stages: a) concept sketching, b) canvas planning, c) shape decomposition, d) coordinate calculation, e) styling and coloring, and f) final assembly. These r… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of Reason-SVG. For science diagrams, the model follows the instruction “drawing an SVG-format diagram following prompt” to generate structured plots and analytic charts. Across diverse SVG categories—including Science Diagram, UI/UX, and Complex Scene—Reason-SVG exhibits strong visual reasoning and structural understanding. The proposed DwT reasoning further enables more coherent layout… view at source ↗
read the original abstract

Generating high-quality Scalable Vector Graphics (SVGs) is challenging for Large Language Models (LLMs), as it requires advanced reasoning for structural validity, semantic accuracy, and visual coherence -- areas where current LLMs often struggle. In this work, we introduce Reason-SVG, a novel framework equipped with enhanced structured reasoning for SVG generation. Reason-SVG pioneers the ``Drawing-with-Thought'' (DwT) paradigm, in which models generate both SVG code and explicit design rationales. Reason-SVG follows a two-stage training strategy: First, Supervised Fine-Tuning (SFT) trains the LLM on the DwT paradigm to develop foundational reasoning abilities. Second, Reinforcement Learning (RL), utilizing Group Relative Policy Optimization (GRPO), empowers the model to generate both DwT and SVG rationales through refined, reward-driven reasoning. To enable reasoning-driven SVG generation, we design a Hybrid Reward function that evaluates the presence and effectiveness of DwT reasoning, along with structural validity, semantic alignment, and visual quality. We also introduce the SVGX-DwT-10k dataset, a high-quality corpus of 10k SVG-DwT pairs, where each SVG code is generated based on explicit DwT reasoning. By integrating DwT, SFT, and Hybrid Reward-guided RL, Reason-SVG significantly improves the performance of LLMs and VLMs in generating accurate and visually coherent SVGs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Reason-SVG, a framework for SVG generation that augments LLMs and VLMs with structured reasoning via the new 'Drawing-with-Thought' (DwT) paradigm. Models are trained to output both SVG code and explicit design rationales. Training proceeds in two stages: supervised fine-tuning (SFT) on the introduced SVGX-DwT-10k dataset of 10k SVG-DwT pairs, followed by reinforcement learning with Group Relative Policy Optimization (GRPO) driven by a hybrid reward that scores DwT reasoning presence/effectiveness together with structural validity, semantic alignment, and visual quality. The central claim is that this pipeline yields significant gains in accurate and visually coherent SVG outputs.

Significance. If the empirical results hold with proper controls, the DwT paradigm and the accompanying dataset constitute a useful contribution toward interpretable, reasoning-augmented generation of structured graphics. The two-stage SFT-then-GRPO recipe is a standard template but is applied here to a new domain with a composite reward; credit is due for releasing the SVGX-DwT-10k corpus. The work sits at the intersection of vision-language models and controllable vector graphics, an area of growing practical interest.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Hybrid Reward): the central performance claim rests on the hybrid reward correctly measuring and incentivizing genuine DwT reasoning rather than superficial rationales. No equations, weighting coefficients, or validation protocol for the reward components are supplied in the abstract or high-level description, leaving open the possibility of reward hacking that inflates metrics without improving SVG coherence.
  2. [§4] §4 (Experiments): the abstract asserts 'significant improvements' for both LLMs and VLMs yet supplies no quantitative results, baseline comparisons, or details on how visual quality is scored. Without these numbers and controls the load-bearing claim that DwT + SFT + GRPO is responsible for the gains cannot be evaluated.
minor comments (2)
  1. Define GRPO and all other acronyms on first use; clarify whether the visual-quality term in the hybrid reward is computed by an automated metric, an LLM judge, or human raters.
  2. Figure captions and dataset statistics should explicitly state the split sizes, diversity of SVG categories, and any filtering criteria applied to the 10k SVG-DwT pairs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to improve clarity and transparency as indicated.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Hybrid Reward): the central performance claim rests on the hybrid reward correctly measuring and incentivizing genuine DwT reasoning rather than superficial rationales. No equations, weighting coefficients, or validation protocol for the reward components are supplied in the abstract or high-level description, leaving open the possibility of reward hacking that inflates metrics without improving SVG coherence.

    Authors: We appreciate the referee's emphasis on reward transparency. While Section 3 describes the four components of the Hybrid Reward (DwT reasoning effectiveness, structural validity, semantic alignment, and visual quality), we acknowledge that explicit equations, specific weighting coefficients, and a validation protocol are not presented at a high level. To address concerns about potential reward hacking, we will revise Section 3 to include the mathematical formulations for each component, the weighting scheme used, and a description of how the reward was validated to prioritize substantive reasoning over superficial outputs. A concise summary of the reward formulation will also be added to the abstract. revision: yes

  2. Referee: §4 (Experiments): the abstract asserts 'significant improvements' for both LLMs and VLMs yet supplies no quantitative results, baseline comparisons, or details on how visual quality is scored. Without these numbers and controls the load-bearing claim that DwT + SFT + GRPO is responsible for the gains cannot be evaluated.

    Authors: We agree that the abstract would be strengthened by including key quantitative evidence. The experiments in §4 report results across LLMs and VLMs with baseline comparisons, using metrics for structural validity, semantic alignment, and visual quality (the latter via automated metrics combined with human evaluation protocols detailed in the section). To make the central claims more evaluable from the abstract alone, we will revise the abstract to incorporate specific performance deltas and baseline references while maintaining conciseness. This revision will better substantiate the contributions of the DwT paradigm, SFT, and GRPO stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical training pipeline (DwT paradigm + SFT + GRPO RL with a newly designed hybrid reward) applied to a newly introduced dataset (SVGX-DwT-10k). No mathematical derivations, fitted parameters renamed as predictions, or self-citations are used to justify the central performance claims. All load-bearing elements are introduced as novel components whose effectiveness is asserted via training outcomes rather than reducing to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on standard LLM fine-tuning assumptions plus the new DwT format and hybrid reward design. No free parameters are explicitly fitted in the abstract description. The main invented element is the DwT reasoning format itself.

axioms (1)
  • domain assumption LLMs can be effectively aligned to produce both reasoning text and code via SFT followed by RL with a composite reward.
    Invoked in the two-stage training strategy section of the abstract.
invented entities (2)
  • Drawing-with-Thought (DwT) paradigm no independent evidence
    purpose: To force explicit design rationales before SVG code generation
    New format introduced to improve structural validity and semantic accuracy.
  • Hybrid Reward function no independent evidence
    purpose: To evaluate DwT presence, structural validity, semantic alignment, and visual quality
    Composite reward designed specifically for this task.

pith-pipeline@v0.9.0 · 5795 in / 1353 out tokens · 22534 ms · 2026-05-19T12:56:33.516913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  2. Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

    cs.CV 2026-04 unverdicted novelty 7.0

    Render-in-the-Loop reformulates SVG generation as a step-wise visual-context-aware process using self-feedback from rendered intermediate states, VSF training, and RaV inference to outperform baselines on MMSVGBench f...

  3. AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

    cs.CV 2026-04 unverdicted novelty 7.0

    AmodalSVG produces semantically separate and geometrically complete SVG layers from natural images by using VLM-guided semantic layer peeling for amodal completion followed by adaptive vectorization.

  4. Structural Evaluation Metrics for SVG Generation via Leave-One-Out Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    Element-level leave-one-out analysis yields per-element quality scores and four structural metrics (purity, coverage, compactness, locality) that quantify SVG modularity and enable artifact detection.

  5. Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling

    cs.LG 2026-04 unverdicted novelty 7.0

    HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 5 Pith papers · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 3, 6, 7, 9

  2. [2]

    Claude 3.5 sonnet

    Anthropic. Claude 3.5 sonnet. https : / / www . anthropic . com / news / claude - 3 - 5 - sonnet, 2024

  3. [3]

    Claude 3.7 sonnet and claude code

    Anthropic. Claude 3.7 sonnet and claude code. https: / / www . anthropic . com / news / claude - 3 - 7 - sonnet, 2025. 3, 6, 7, 9

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 5, 6, 7, 9, 12

  5. [5]

    Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33:16351–16361, 2020

    Alexandre Carlier, Martin Danelljan, Alexandre Alahi, and Radu Timofte. Deepsvg: A hierarchical generative network for vector graphics animation.Advances in Neural Informa- tion Processing Systems (NeurIPS), 33:16351–16361, 2020. 3

  6. [6]

    CairoSVG: A Simple SVG Con- verter based on Cairo

    CourtBouillon. CairoSVG: A Simple SVG Con- verter based on Cairo. https : / / cairosvg . org / documentation/, 2024. Version 2.7.1 or later. Accessed: 2025-05-14. 5, 9, 10, 11

  7. [7]

    Draw with thought: Unleashing multimodal reasoning for scientific diagram generation.arXiv preprint arXiv:2504.09479, 2025

    Zhiqing Cui, Jiahao Yuan, Hanqing Wang, Yanshu Li, Chenxu Du, and Zhenglong Ding. Draw with thought: Unleashing multimodal reasoning for scientific diagram generation.arXiv preprint arXiv:2504.09479, 2025. 3

  8. [8]

    Gemini 2.5 pro - best for coding and complex prompts

    DeepMind. Gemini 2.5 pro - best for coding and complex prompts. https : / / deepmind . google / technologies/gemini/pro/, 2024. 6, 9, 11

  9. [9]

    Shuguang Dou, Xinyang Jiang, Lu Liu, Lu Ying, Caihua Shan, Yifei Shen, Xuanyi Dong, Yun Wang, Dongsheng Li, and Cairong Zhao. Hierarchically recognizing vector graphics and a new chart-based vector graphics dataset.IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 46 (12):7556–7573, 2024. 3

  10. [10]

    CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders

    Kevin Frans, Lisa Soros, and Olaf Witkowski. CLIPDraw: Exploring text-to-drawing synthesis through language-image encoders. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 1, 3

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2, 3, 4, 6, 7, 9

  12. [12]

    A neural representation of sketch drawings

    David Ha and Douglas Eck. A neural representation of sketch drawings. InInternational Conference on Learning Represen- tations (ICLR), 2018. 3

  13. [13]

    Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors

    Juncheng Hu, Ximing Xing, Jing Zhang, and Qian Yu. Vec- torpainter: Advanced stylized vector graphics synthesis using stroke-style priors. InIEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025. 1, 3

  14. [14]

    Supersvg: Superpixel-based scalable vector graphics synthesis

    Teng Hu, Ran Yi, Baihong Qian, Jiangning Zhang, Paul L Rosin, and Yu-Kun Lai. Supersvg: Superpixel-based scalable vector graphics synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24892–24901, 2024. 3

  15. [15]

    Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4), 2023

    Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography.ACM Transactions on Graphics (TOG), 42(4), 2023

  16. [16]

    Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models

    Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text- to-svg by abstracting pixel-based diffusion models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 1, 3, 6, 7, 9

  17. [17]

    Recognizing vector graphics without rasterization

    Xinyang Jiang, Lu Liu, Caihua Shan, Yifei Shen, Xuanyi Dong, and Dongsheng Li. Recognizing vector graphics without rasterization. InProceedings of the 35th Interna- tional Conference on Neural Information Processing Systems (NeurIPS), Red Hook, NY , USA, 2021. Curran Associates Inc. 3

  18. [18]

    Unisvg: A unified dataset for vector graphic understanding and genera- tion with multimodal large language models

    Jinke Li, Jiarui Yu, Chenxing Wei, Hande Dong, Qiang Lin, Liangjing Yang, Zhicai Wang, and Yanbin Hao. Unisvg: A unified dataset for vector graphic understanding and genera- tion with multimodal large language models. InProceedings 13 of the 33rd ACM International Conference on Multimedia, pages 13156–13163, 2025. 1, 3

  19. [19]

    Starcoder: may the source be with you! Transactions on Machine Learning Research (TMLR), 2023

    Raymond Li, Loubna Ben allal, Yangtian Zi, Niklas Muen- nighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia LI, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier De- haene, Joel Lamy-Poirier, Joao Monteiro, Nicolas Gontier, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Ben Lip- kin, Muhtasham Oblokul...

  20. [20]

    Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020

    Tzu-Mao Li, Michal Lukáˇc, Gharbi Michaël, and Jonathan Ragan-Kelley. Differentiable vector graphics rasterization for editing and learning.ACM Transactions on Graphics (TOG), 39(6):193:1–193:15, 2020. 1, 3

  21. [21]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 3

  22. [22]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InThirty-seventh Conference on Neural Information Processing Systems (NeurIP), 2023. 3

  23. [23]

    A learned representation for scalable vec- tor graphics

    Raphael Gontijo Lopes, David Ha, Douglas Eck, and Jonathon Shlens. A learned representation for scalable vec- tor graphics. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019. 1, 3

  24. [24]

    Towards layer-wise image vectorization

    Xu Ma, Yuqian Zhou, Xingqian Xu, Bin Sun, Valerii Filev, Nikita Orlov, Yun Fu, and Humphrey Shi. Towards layer-wise image vectorization. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 16314–16323, 2022. 3

  25. [25]

    Chart4blind: An intelligent interface for chart accessibility conversion

    Omar Moured, Morris Baumgarten-Egemole, Karin Müller, Alina Roitberg, Thorsten Schwarz, and Rainer Stiefelhagen. Chart4blind: An intelligent interface for chart accessibility conversion. InProceedings of the 29th International Con- ference on Intelligent User Interfaces, pages 504–514, 2024. 1

  26. [26]

    Svgeditbench: A bench- mark dataset for quantitative assessment of llm’s svg editing capabilities

    Kunato Nishina and Yusuke Matsui. Svgeditbench: A bench- mark dataset for quantitative assessment of llm’s svg editing capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8142–8147,

  27. [27]

    Introducing openai o3 and o4-mini

    OpenAI. Introducing openai o3 and o4-mini. https:// openai.com/index/introducing- o3- and- o4- mini/, 2025. 3, 6

  28. [28]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patrick...

  29. [29]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems (NeurIP), 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems (NeurIP), 35:27730–27744, 2022. 3

  30. [30]

    Neuralsvg: An implicit repre- sentation for text-to-vector generation.arXiv preprint arXiv:2501.03992, 2025

    Sagi Polaczek, Yuval Alaluf, Elad Richardson, Yael Vinker, and Daniel Cohen-Or. Neuralsvg: An implicit repre- sentation for text-to-vector generation.arXiv preprint arXiv:2501.03992, 2025. 3

  31. [31]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021. 1, 3, 5, 9, 10

  32. [32]

    Im2vec: Synthesizing vector graphics without vector supervision

    Pradyumna Reddy, Michael Gharbi, Michal Lukac, and Niloy J Mitra. Im2vec: Synthesizing vector graphics without vector supervision. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pages 7342–7351, 2021. 3

  33. [33]

    Starvector: Generating scalable vector graphics code from images

    Juan A Rodriguez, Shubham Agarwal, Issam H Laradji, Pau Rodriguez, David Vazquez, Christopher Pal, and Marco Ped- ersoli. Starvector: Generating scalable vector graphics code from images. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 6, 7, 9

  34. [34]

    Juan A. Rodriguez, Haotian Zhang, Abhay Puri, Rishav Pramanik, Aarash Feizi, Pascal Wichmann, Arnab Kumar Mondal, Mohammad Reza Samsami, Rabiul Awal, Perouz Taslakian, Spandana Gella, Sai Rajeswar, David Vazquez, Christopher Pal, and Marco Pedersoli. Rendering-aware re- inforcement learning for vector graphics generation. InThe Thirty-ninth Annual Confere...

  35. [35]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms.arXiv preprint arXiv:1707.06347, 2017. 3, 4

  36. [36]

    Clipgen: A deep generative model for clipart vectorization and synthesis.IEEE Trans- actions on Visualization and Computer Graphics (TOG), 28 (12):4211–4224, 2022

    I-Chao Shen and Bing-Yu Chen. Clipgen: A deep generative model for clipart vectorization and synthesis.IEEE Trans- actions on Visualization and Computer Graphics (TOG), 28 (12):4211–4224, 2022. 3

  37. [37]

    Clipvg: Text-guided image manipulation using differentiable vector graphics

    Yiren Song, Xuning Shao, Kang Chen, Weidong Zhang, Zhongliang Jing, and Minzhe Li. Clipvg: Text-guided image manipulation using differentiable vector graphics. InProceed- ings of the Conference on Artificial Intelligence (AAAI), 2023. 3

  38. [38]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason- 14 rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025. 3

  39. [39]

    Strokenuwa: tokeniz- ing strokes for vector graphic synthesis

    Zecheng Tang, Chenfei Wu, Zekai Zhang, Minheng Ni, Shengming Yin, Yu Liu, Zhengyuan Yang, Lijuan Wang, Zicheng Liu, Juntao Li, and Nan Duan. Strokenuwa: tokeniz- ing strokes for vector graphic synthesis. InProceedings of the 41st International Conference on Machine Learning (ICML). JMLR.org, 2024. 1, 3

  40. [40]

    Vecfusion: Vector font gen- eration with diffusion

    Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Michaël Gharbi, Oliver Wang, Alec Jacob- son, and Evangelos Kalogerakis. Vecfusion: Vector font gen- eration with diffusion. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 7943–7952, 2024. 1

  41. [41]

    Nivel: Neural implicit vector layers for text-to-vector generation

    Vikas Thamizharasan, Difan Liu, Matthew Fisher, Nanxuan Zhao, Evangelos Kalogerakis, and Michal Lukac. Nivel: Neural implicit vector layers for text-to-vector generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4589–4597,

  42. [42]

    Clipasso: Semantically-aware ob- ject sketching.ACM Transactions on Graphics (TOG), 41(4): 1–11, 2022

    Yael Vinker, Ehsan Pajouheshgar, Jessica Y Bo, Roman Chris- tian Bachmann, Amit Haim Bermano, Daniel Cohen-Or, Amir Zamir, and Ariel Shamir. Clipasso: Semantically-aware ob- ject sketching.ACM Transactions on Graphics (TOG), 41(4): 1–11, 2022. 3

  43. [43]

    Clipascene: Scene sketching with different types and levels of abstraction

    Yael Vinker, Yuval Alaluf, Daniel Cohen-Or, and Ariel Shamir. Clipascene: Scene sketching with different types and levels of abstraction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4146–4156, 2023. 3

  44. [44]

    Svgen: Interpretable vector graphics generation with large language models

    Feiyu Wang, Zhiyuan Zhao, Yuandong Liu, Da Zhang, Junyu Gao, Hao Sun, and Xuelong Li. Svgen: Interpretable vector graphics generation with large language models. InProceed- ings of the 33rd ACM International Conference on Multime- dia, page 9608–9617, 2025. 1, 3

  45. [45]

    Internsvg: Towards unified svg tasks with multimodal large language models.arXiv preprint arXiv:2510.11341, 2025

    Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, et al. Internsvg: Towards unified svg tasks with multimodal large language models.arXiv preprint arXiv:2510.11341, 2025. 1

  46. [46]

    Deepvecfont: Synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6), 2021

    Yizhi Wang and Zhouhui Lian. Deepvecfont: Synthesizing high-quality vector fonts via dual-modality learning.ACM Transactions on Graphics (TOG), 40(6), 2021. 1, 3

  47. [47]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022. 3

  48. [48]

    Unified multimodal chain-of-thought reward model through reinforcement fine-tuning.arXiv:2505.03318, 2025

    Yibin Wang, Zhimin Li, Yuhang Zang, Chunyu Wang, Qinglin Lu, Cheng Jin, and Jiaqi Wang. Unified multimodal chain- of-thought reward model through reinforcement fine-tuning. arXiv preprint arXiv:2505.03318, 2025. 3

  49. [49]

    Visually descrip- tive language model for vector graphics reasoning.Transac- tions on Machine Learning Research

    Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan-Hao Huang, Manling Li, Jiajun Wu, and Heng Ji. Visually descrip- tive language model for vector graphics reasoning.Transac- tions on Machine Learning Research. 3

  50. [50]

    Icon- shop: Text-guided vector icon synthesis with autoregressive transformers.ACM Trans

    Ronghuan Wu, Wanchao Su, Kede Ma, and Jing Liao. Icon- shop: Text-guided vector icon synthesis with autoregressive transformers.ACM Trans. Graph., 42(6), 2023. 1, 3

  51. [51]

    Chat2svg: Vector graphics generation with large language models and image diffusion models

    Ronghuan Wu, Wanchao Su, and Jing Liao. Chat2svg: Vector graphics generation with large language models and image diffusion models. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  52. [52]

    Human preference score: Better aligning text-to- image models with human preference

    Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong- sheng Li. Human preference score: Better aligning text-to- image models with human preference. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2096–2105, 2023. 5, 9, 10

  53. [53]

    DiffSketcher: Text guided vector sketch synthesis through latent diffusion models

    Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, and Dong Xu. DiffSketcher: Text guided vector sketch synthesis through latent diffusion models. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 1, 3, 6, 7, 9

  54. [54]

    SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

    Ximing Xing, Juncheng Hu, Jing Zhang, Dong Xu, and Qian Yu. Svgfusion: Scalable text-to-svg generation via vector space diffusion.arXiv preprint arXiv:2412.10437, 2024. 3

  55. [55]

    SVGDreamer: Text guided svg generation with diffusion model

    Ximing Xing, Haitao Zhou, Chuang Wang, Jing Zhang, Dong Xu, and Qian Yu. SVGDreamer: Text guided svg generation with diffusion model. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 4546–4555, 2024. 1, 3, 6, 7, 9

  56. [56]

    Empowering llms to understand and gener- ate complex vector graphics

    Ximing Xing, Juncheng Hu, Guotao Liang, Jing Zhang, Dong Xu, and Qian Yu. Empowering llms to understand and gener- ate complex vector graphics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 3, 5, 6, 7, 9

  57. [57]

    SVGDreamer++: Advancing editabil- ity and diversity in text-guided svg generation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence (T-PAMI), pages 1–18, 2025

    Ximing Xing, Qian Yu, Chuang Wang, Haitao Zhou, Jing Zhang, and Dong Xu. SVGDreamer++: Advancing editabil- ity and diversity in text-guided svg generation.IEEE Transac- tions on Pattern Analysis and Machine Intelligence (T-PAMI), pages 1–18, 2025. 3

  58. [58]

    Exploring the capability of llms in performing low-level visual analytic tasks on svg data visualizations

    Zhongzheng Xu and Emily Wall. Exploring the capability of llms in performing low-level visual analytic tasks on svg data visualizations. In2024 IEEE Visualization and Visual Analytics (VIS), pages 126–130. IEEE, 2024. 1

  59. [59]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024. 3

  60. [60]

    Omnisvg: A unified scalable vector graphics generation model

    Yiying Yang, Wei Cheng, Sijin Chen, Xianfang Zeng, Fukun Yin, Jiaxu Zhang, Liao Wang, Gang Yu, Xingjun Ma, and Yu-Gang Jiang. Omnisvg: A unified scalable vector graphics generation model. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 1, 3

  61. [61]

    Text-guided vector graphics customization

    Peiying Zhang, Nanxuan Zhao, and Jing Liao. Text-guided vector graphics customization. InSIGGRAPH Asia 2023 Conference Papers, New York, NY , USA, 2023. Association for Computing Machinery. 3

  62. [62]

    Text-to-vector generation with neural path representation.ACM Transactions on Graphics (TOG), 43(4):1–13, 2024

    Peiying Zhang, Nanxuan Zhao, and Jing Liao. Text-to-vector generation with neural path representation.ACM Transactions on Graphics (TOG), 43(4):1–13, 2024. 1, 3

  63. [63]

    Beyond pixels: Exploring human-readable svg generation for simple images with vision language models

    Tong Zhang, Haoyang Liu, Peiyan Zhang, Yuxuan Cheng, and Haohan Wang. Beyond pixels: Exploring human-readable svg generation for simple images with vision language models. ArXiv, abs/2311.15543, 2023. 3 15

  64. [64]

    R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025

    Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, et al. R1-reward: Training multimodal reward model through stable reinforcement learning.arXiv preprint arXiv:2505.02835, 2025. 3

  65. [65]

    VG- Bench: Evaluating large language models on vector graphics understanding and generation

    Bocheng Zou, Mu Cai, Jianrui Zhang, and Yong Jae Lee. VG- Bench: Evaluating large language models on vector graphics understanding and generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP), pages 3647–3659, Miami, Florida, USA,

  66. [66]

    Association for Computational Linguistics. 3 16