pith. machine review for the scientific record. sign in

arxiv: 2604.20730 · v2 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

Dong Xu, Guotao Liang, Haitao Zhou, Jing Zhang, Juncheng Hu, Qian Yu, Zhangcheng Wang, Ziteng Xue

Pith reviewed 2026-05-10 01:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords SVG generationvector graphicsmultimodal LLMsvisual feedbackrender-in-the-looptext-to-SVGimage-to-SVG
0
0 comments X

The pith

Multimodal models generate better SVGs by observing their own intermediate renderings during code synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that treating SVG generation as blind text sequence modeling wastes the visual capabilities of multimodal models. Instead, rendering partial drawings at each step lets the model see and reason about evolving canvas states and occlusions that are hard to track in code alone. They show this by breaking paths into fine steps, training the model to condition new primitives on the rendered visuals, and adding a verify step at inference. The result is higher quality outputs on standard benchmarks for both text and image inputs. A reader would care if this means vision priors can be directly harnessed for more accurate and efficient code-based image creation.

Core claim

SVG synthesis is reformulated as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step and leverages on-the-fly feedback to guide subsequent generation. Fine-grained path decomposition constructs dense multi-step visual trajectories, Visual Self-Feedback training conditions next-primitive generation on intermediate visual states, and Render-and-Verify inference filters degenerate primitives. This approach outperforms strong open-weight baselines on MMSVGBench for both Text-to-SVG and Image-to-SVG tasks.

What carries the argument

The Render-in-the-Loop process that renders cumulative canvases from partial SVG code and feeds them back as visual context to condition the next generation step, enabled by Visual Self-Feedback training.

Load-bearing premise

That off-the-shelf multimodal models cannot use incremental visual-code mappings without the specific fine-grained path decomposition and Visual Self-Feedback training strategy.

What would settle it

Measure SVG generation quality on occlusion-heavy test cases when the same model is run once with the visual self-feedback loop enabled and once without it.

Figures

Figures reproduced from arXiv: 2604.20730 by Dong Xu, Guotao Liang, Haitao Zhou, Jing Zhang, Juncheng Hu, Qian Yu, Zhangcheng Wang, Ziteng Xue.

Figure 1
Figure 1. Figure 1: Render-in-the-Loop Generation. (a) Our Visual Self-Feedback explic￾itly renders intermediate code into a canvas, feeding it back to provide continuous visual guidance. (b) Traditional open-loop approaches draw “blindly” using only tex￾tual history, often struggling with geometric accuracy and visual quality. By closing the loop, our method ensures structurally coherent and high-quality SVG synthesis. Abstr… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of Fine-grained Path Decomposition. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on MMSVG benchmarks. (a) In Text-to-SVG, our method generates clear semantic details (e.g., facial expressions, uniform details) where baselines like OmniSVG often fail or produce noise. (b) In Image-to-SVG, our model effectively reconstructs internal structures (e.g., flashlight lens, briefcase patterns), demonstrating superior fidelity to the input raster image. complex Illustratio… view at source ↗
Figure 4
Figure 4. Figure 4: Instruction-following examples. Our model accurately applies specific attributes (e.g., colors and numbers) to their corresponding objects based on complex prompts [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A gallery of diverse vector graphics generated by our Render-in-the-Loop model. Our approach consistently produces aesthetically pleasing and semantically coherent SVGs across a wide variety of subjects, maintaining topologically clean path structures. taining high visual quality and standard SVG structure. Compared to autoregres￾sive baselines like OmniSVG and InternSVG, our model demonstrates superior se… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation study. w/o VSF fails to capture semantic details (e.g., missing bar chart, hallucinating an eye for an egg). w/o RaV suffers from degenerate repetition (e.g., re-drawing the same leaf), leading to incomplete shapes. Our full model generates correct and complete graphics [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Render-in-the-Loop, a paradigm that reformulates SVG generation with MLLMs as a step-wise visual-context-aware process. Intermediate code is rendered to a cumulative canvas to provide on-the-fly visual feedback, addressing limitations of open-loop 'blind drawing' in reasoning about partial states and occlusions. The approach uses fine-grained path decomposition to create dense multi-step trajectories, a Visual Self-Feedback (VSF) training strategy to condition next-primitive generation on rendered states, and a Render-and-Verify (RaV) inference filter for degenerate primitives. Instantiated on a multimodal foundation model, the framework is claimed to outperform strong open-weight baselines on MMSVGBench for both Text-to-SVG and Image-to-SVG, with emphasis on data efficiency and generalization.

Significance. If the empirical claims hold after proper isolation of components, the work could meaningfully advance vector-graphics synthesis by better exploiting visual priors already present in MLLM encoders rather than treating generation as pure sequence modeling. The data-efficiency angle is particularly relevant for domains where SVG annotations are scarce. However, the absence of reported metrics, baselines, or ablations in the abstract, combined with the untested assumption that gains arise from visual reasoning rather than trajectory density, limits the immediate assessable impact.

major comments (3)
  1. Abstract: the claim that the framework 'outperforms strong open-weight baselines on the standard MMSVGBench' and highlights 'remarkable data efficiency' supplies no quantitative metrics, baseline names, error bars, or statistical tests, preventing verification of the central empirical result.
  2. Method section (Visual Self-Feedback training and path decomposition): the paper states that naive visual-loop application is suboptimal due to off-the-shelf MLLMs' inability to leverage incremental visual-code mappings, yet no ablation is described that retains fine-grained decomposition and the training objective while masking or ablating the visual feedback channel; without this isolation it remains possible that reported gains derive primarily from denser trajectories rather than learned visual-state conditioning.
  3. Experiments (MMSVGBench results): the abstract asserts generalization across Text-to-SVG and Image-to-SVG tasks, but without reported per-task breakdowns, ablation tables, or comparisons against decomposition-only variants, the load-bearing claim that VSF specifically enables occlusion/partial-state reasoning cannot be evaluated.
minor comments (2)
  1. Abstract: the phrase 'Render-in-the-Loop paradigm' is introduced without a concise one-sentence definition before the detailed description, which would aid readability.
  2. Notation: 'VSF' and 'RaV' are used without an initial expansion in the abstract, although they are later defined in the body.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have reviewed the major comments carefully and will revise the manuscript to address the concerns about the abstract, experimental isolation, and result breakdowns. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: Abstract: the claim that the framework 'outperforms strong open-weight baselines on the standard MMSVGBench' and highlights 'remarkable data efficiency' supplies no quantitative metrics, baseline names, error bars, or statistical tests, preventing verification of the central empirical result.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will expand the abstract to report the key performance numbers on MMSVGBench (for both Text-to-SVG and Image-to-SVG), name the strong open-weight baselines, and include any available error bars or statistical details. revision: yes

  2. Referee: Method section (Visual Self-Feedback training and path decomposition): the paper states that naive visual-loop application is suboptimal due to off-the-shelf MLLMs' inability to leverage incremental visual-code mappings, yet no ablation is described that retains fine-grained decomposition and the training objective while masking or ablating the visual feedback channel; without this isolation it remains possible that reported gains derive primarily from denser trajectories rather than learned visual-state conditioning.

    Authors: This is a fair criticism. While the manuscript contrasts the proposed approach against open-loop baselines and notes the sub-optimality of naive visual-loop use, it does not contain a controlled ablation that keeps the fine-grained path decomposition and training objective but removes the visual-feedback channel. We will add this ablation to the revised version to isolate the contribution of VSF from trajectory density. revision: yes

  3. Referee: Experiments (MMSVGBench results): the abstract asserts generalization across Text-to-SVG and Image-to-SVG tasks, but without reported per-task breakdowns, ablation tables, or comparisons against decomposition-only variants, the load-bearing claim that VSF specifically enables occlusion/partial-state reasoning cannot be evaluated.

    Authors: We acknowledge the need for greater transparency. The current manuscript presents aggregate MMSVGBench results, but we will add per-task breakdowns, explicit ablation tables that include decomposition-only variants, and additional analysis demonstrating how VSF improves reasoning about occlusions and partial canvas states. revision: yes

Circularity Check

0 steps flagged

No circularity; new training/inference strategy on external MLLMs with benchmark evaluation

full rationale

The paper introduces Render-in-the-Loop as a step-wise visual feedback paradigm, fine-grained path decomposition for trajectories, Visual Self-Feedback (VSF) training, and Render-and-Verify inference. These are presented as additions to off-the-shelf multimodal models rather than derivations from fitted parameters or self-referential definitions. No equations reduce outputs to inputs by construction, and no self-citations form load-bearing uniqueness claims. Performance claims rest on MMSVGBench comparisons, which are external to the method definition. The assertion that naive visual-loop use is suboptimal is stated directly without circular justification or reduction to the proposed components.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MLLM vision encoders contain usable visual priors for spatial reasoning and that rendering intermediates supplies actionable feedback the model can learn to exploit.

axioms (1)
  • domain assumption Multimodal LLMs possess visual priors in their vision encoders that can be leveraged for spatial reasoning during generation.
    Invoked to explain why visual feedback from rendered states improves over blind code synthesis.

pith-pipeline@v0.9.0 · 5626 in / 1265 out tokens · 77149 ms · 2026-05-10T01:09:00.934684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 7.0

    BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.

  2. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  3. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 6.0

    BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.

Reference graph

Works this paper leans on

63 extracted references · 13 canonical work pages · cited by 2 Pith papers · 9 internal anchors

  1. [1]

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) Render-in-the-Loop 21 Input ImageLIVEDiffVGOurs Image-to-SVG Fig.S4:A zoomed-in visualization of the SVG paths generated by optimization-based baselines (LIV...

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  3. [3]

    Advances in Neural Information Processing Systems (NeurIPS)33, 16351–16361 (2020)

    Carlier, A., Danelljan, M., Alahi, A., Timofte, R.: Deepsvg: A hierarchical gen- erative network for vector graphics animation. Advances in Neural Information Processing Systems (NeurIPS)33, 16351–16361 (2020)

  4. [4]

    ChatGPT: Gpt-5 (2024),https://chatgpt.com

  5. [5]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Chen, H., Zhao, Z., Chen, Y., Liang, Z., Ni, B.: Svgthinker: Instruction-aligned and reasoning-driven text-to-svg generation. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 11004–11012 (2025)

  6. [6]

    Consortium, W.W.W.: Scalable vector graphics (svg) specification (1999),https: //www.w3.org/TR/1999/WD-SVG-19990211/

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is 22 G. Liang et al. worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  8. [8]

    In: Advances in Neural Information Pro- cessing Systems (NeurIPS) (2022)

    Frans, K., Soros, L., Witkowski, O.: CLIPDraw: Exploring text-to-drawing syn- thesis through language-image encoders. In: Advances in Neural Information Pro- cessing Systems (NeurIPS) (2022)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Gal, R., Vinker, Y., Alaluf, Y., Bermano, A., Cohen-Or, D., Shamir, A., Chechik, G.: Breathing life into sketches using text-to-video priors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4325– 4336 (2024)

  10. [10]

    Springer Science & Business Media (2005)

    Geroimenko, V.: Visualizing Information Using SVG and X3D: XML-based Tech- nologies for the XML-based Web. Springer Science & Business Media (2005)

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  12. [12]

    In: International Conference on Learning Representations (ICLR) (2018),https://openreview

    Ha, D., Eck, D.: A neural representation of sketch drawings. In: International Conference on Learning Representations (ICLR) (2018),https://openreview. net/forum?id=Hy6GHpkCW

  13. [13]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  14. [14]

    In: Proceedings of the AAAI Conference on Artificial In- telligence

    Hirschorn, O., Jevnisek, A., Avidan, S.: Optimize & reduce: a top-down approach for image vectorization. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 38, pp. 2148–2156 (2024)

  15. [15]

    In: Advances inNeuralInformationProcessingSystems(NeurIPS).vol.33,pp.6840–6851(2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Advances inNeuralInformationProcessingSystems(NeurIPS).vol.33,pp.6840–6851(2020)

  16. [16]

    In: 2025 IEEE International Conference on Multimedia and Expo (ICME)

    Hu, J., Xing, X., Zhang, J., Yu, Q.: Vectorpainter: Advanced stylized vector graph- ics synthesis using stroke-style priors. In: 2025 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2025)

  17. [17]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Hu, T., Yi, R., Qian, B., Zhang, J., Rosin, P.L., Lai, Y.K.: Supersvg: Superpixel- based scalable vector graphics synthesis. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 24892–24901 (2024)

  18. [18]

    ACM Transactions on Graphics (TOG)42(4), 1–11 (2023)

    Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as- image for semantic typography. ACM Transactions on Graphics (TOG)42(4), 1–11 (2023)

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition

    Jain, A., Xie, A., Abbeel, P.: Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition. pp. 1911–1920 (2023)

  20. [20]

    ACM Transactions on Graphics (TOG) 39(6), 1–15 (2020)

    Li, T.M., Lukáč, M., Gharbi, M., Ragan-Kelley, J.: Differentiable vector graph- ics rasterization for editing and learning. ACM Transactions on Graphics (TOG) 39(6), 1–15 (2020)

  21. [21]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Liang, G., Hu, J., Xing, X., Zhang, J., Yu, Q.: Multi-object sketch animation with grouping and motion trajectory priors. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9237–9246 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, J., Xin, Z., Fu, Y., Zhao, R., Lan, B., Li, X.: Multi-object sketch animation by scene decomposition and motion planning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11537–11546 (2025)

  23. [23]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lopes, R.G., Ha, D., Eck, D., Shlens, J.: A learned representation for scalable vector graphics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7930–7939 (2019) Render-in-the-Loop 23

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ma, X., Zhou, Y., Xu, X., Sun, B., Filev, V., Orlov, N., Fu, Y., Shi, H.: Towards layer-wise image vectorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16314–16323 (2022)

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  26. [26]

    In: The Eleventh International Conference on Learning Representations (ICLR) (2023)

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. In: The Eleventh International Conference on Learning Representations (ICLR) (2023)

  27. [27]

    In: International Conference on Machine Learning (ICML)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Reddy, P., Gharbi, M., Lukac, M., Mitra, N.J.: Im2vec: Synthesizing vector graph- ics without vector supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7342–7351 (2021)

  29. [29]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Rodriguez, J.A., Puri, A., Agarwal, S., Laradji, I.H., Rodriguez, P., Rajeswar, S., Vazquez, D., Pal, C., Pedersoli, M.: Starvector: Generating scalable vector graphics code from images and text. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16175–16186 (2025)

  30. [30]

    A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M

    Rodriguez, J.A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M.R., Awal, R., Taslakian, P., et al.: Rendering-aware rein- forcement learning for vector graphics generation. arXiv preprint arXiv:2505.20793 (2025)

  31. [31]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  32. [32]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  33. [33]

    In: International Conference on Machine Learning

    Tang, Z., Wu, C., Zhang, Z., Ni, M., Yin, S., Liu, Y., Yang, Z., Wang, L., Liu, Z., Li, J., et al.: Strokenuwa: Tokenizing strokes for vector graphic synthesis. In: International Conference on Machine Learning. pp. 47830–47845. PMLR (2024)

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Thamizharasan, V., Liu, D., Fisher, M., Zhao, N., Kalogerakis, E., Lukac, M.: Nivel: Neural implicit vector layers for text-to-vector generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4589–4597 (2024)

  35. [35]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

    Vinker, Y., Alaluf, Y., Cohen-Or, D., Shamir, A.: Clipascene: Scene sketching with different types and levels of abstraction. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 4146–4156 (2023)

  36. [36]

    ACM Transactions on Graphics (TOG)41(4), 1–11 (2022)

    Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen- Or, D., Zamir, A., Shamir, A.: Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG)41(4), 1–11 (2022)

  37. [37]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Vinker, Y., Shaham, T.R., Zheng, K., Zhao, A., E Fan, J., Torralba, A.: Sketcha- gent: Language-driven sequential sketch generation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 23355–23368 (2025)

  38. [38]

    Viewcraft3d: High-fidelity and view-consistent 3d vector graphics synthesis

    Wang, C., Zhou, H., Luo, L., Yu, Q.: Viewcraft3d: High-fidelity and view-consistent 3d vector graphics synthesis. arXiv preprint arXiv:2505.19492 (2025)

  39. [39]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Wang,F.,Zhao,Z.,Liu,Y.,Zhang,D.,Gao,J.,Sun,H.,Li,X.:Svgen:Interpretable vector graphics generation with large language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9608–9617 (2025) 24 G. Liang et al

  40. [40]

    arXiv preprint arXiv:2510.11341 (2025)

    Wang, H., Yin, J., Wei, Q., Zeng, W., Gu, L., Ye, S., Gao, Z., Wang, Y., Zhang, Y., Li, Y., et al.: Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341 (2025)

  41. [41]

    ACM Transactions on Graphics (TOG)40(6) (2021)

    Wang, Y., Lian, Z.: Deepvecfont: Synthesizing high-quality vector fonts via dual- modality learning. ACM Transactions on Graphics (TOG)40(6) (2021)

  42. [42]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Z., Huang, J., Sun, Z., Gong, Y., Cohen-Or, D., Lu, M.: Layered image vectorization via semantic simplification. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7728–7738 (2025)

  43. [43]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  44. [44]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, R., Su, W., Liao, J.: Chat2svg: Vector graphics generation with large language models and image diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23690–23700 (2025)

  45. [45]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

    Wu, R., Su, W., Liao, J.: Layerpeeler: Autoregressive peeling for layer-wise image vectorization. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025)

  46. [46]

    ACM Transactions on Graphics (TOG)42(6), 1–14 (2023)

    Wu, R., Su, W., Ma, K., Liao, J.: Iconshop: Text-guided vector icon synthesis with autoregressive transformers. ACM Transactions on Graphics (TOG)42(6), 1–14 (2023)

  47. [47]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, X., Sun, K., Zhu, F., Zhao, R., Li, H.: Human preference score: Better aligning text-to-image models with human preference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2096–2105 (2023)

  48. [48]

    Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning

    Xing, X., Guan, Y., Zhang, J., Xu, D., Yu, Q.: Reason-svg: Hybrid reward rl for aha-moments in vector graphics generation. arXiv preprint arXiv:2505.24499 (2025)

  49. [49]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xing, X., Hu, J., Liang, G., Zhang, J., Xu, D., Yu, Q.: Empowering llms to un- derstand and generate complex vector graphics. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19487–19497 (2025)

  50. [50]

    SVGFusion: A VAE-Diffusion Transformer for Vector Graphic Generation

    Xing, X., Hu, J., Zhang, J., Xu, D., Yu, Q.: Svgfusion: Scalable text-to-svg gener- ation via vector space diffusion. arXiv preprint arXiv:2412.10437 (2024)

  51. [51]

    Advances in Neural Infor- mation Processing Systems36, 15869–15889 (2023)

    Xing, X., Wang, C., Zhou, H., Zhang, J., Yu, Q., Xu, D.: Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. Advances in Neural Infor- mation Processing Systems36, 15869–15889 (2023)

  52. [52]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Xing, X., Yu, Q., Wang, C., Zhou, H., Zhang, J., Xu, D.: Svgdreamer++: Advanc- ing editability and diversity in text-guided svg generation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  53. [53]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xing, X., Zhou, H., Wang, C., Zhang, J., Xu, D., Yu, Q.: Svgdreamer: Text guided svg generation with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4546–4555 (2024)

  54. [54]

    Advances in Neural Information Processing Systems36, 15903–15935 (2023)

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023)

  55. [55]

    In: Pro- ceedings of the eleventh international conference on 3D web technology

    Yan, W.: Integrating web 2d and 3d technologies for architectural visualization: applications of svg and x3d/vrml in environmental behavior simulation. In: Pro- ceedings of the eleventh international conference on 3D web technology. pp. 37–45 (2006)

  56. [56]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) Render-in-the-Loop 25

    Yang, Y., Cheng, W., Chen, S., Zeng, X., Yin, F., Zhang, J., Wang, L., Yu, G., Ma, X., Jiang, Y.G.: Omnisvg: A unified scalable vector graphics generation model. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) Render-in-the-Loop 25

  57. [57]

    Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Song, S., Huang, G.: Does rein- forcement learning really incentivize reasoning capacity in llms beyond the base model? arXiv preprint arXiv:2504.13837 (2025)

  58. [58]

    arXiv preprint arXiv:2512.10894 (2025)

    Zhang, P., Zhao, N., Fisher, M., Xu, Y., Liao, J., Liu, D.: Duetsvg: Uni- fied multimodal svg generation with internal visual guidance. arXiv preprint arXiv:2512.10894 (2025)

  59. [59]

    ACM Transactions on Graphics (TOG)43(4), 1–13 (2024)

    Zhang, P., Zhao, N., Liao, J.: Text-to-vector generation with neural path represen- tation. ACM Transactions on Graphics (TOG)43(4), 1–13 (2024)

  60. [60]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  61. [61]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhao, K., Bao, L., Li, Y., Su, X., Zhang, K., Qiao, X.: Less is more: Efficient image vectorization with adaptive parameterization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18166–18175 (2025)

  62. [62]

    In: European Conference on Computer Vision

    Zhou, H., Zhang, H., Wang, B.: Segmentation-guided layer-wise image vectoriza- tion with gradient fills. In: European Conference on Computer Vision. pp. 165–180. Springer (2024)

  63. [63]

    In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Zhu, H., Chong, J.I., Hu, T., Yi, R., Lai, Y.K., Rosin, P.L.: Samvg: A multi-stage image vectorization model with the segment-anything model. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4350–4354. IEEE (2024)