pith. machine review for the scientific record. sign in

arxiv: 2604.10940 · v1 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords amodal vectorizationsemantic layer peelingSVGimage vectorizationocclusion inpaintingvector graphics editingVLM-guided decomposition
0
0 comments X

The pith

AmodalSVG reconstructs natural images into separate editable vector layers that include the full geometry of each object, even parts hidden by occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AmodalSVG as a way to convert photographs into vector graphics where every object appears as its own complete shape rather than stopping at visible edges. It does this by first breaking the image into semantic layers with the help of a vision-language model that guides progressive separation and uses inpainting to fill in occluded areas, then vectorizing each layer on its own. The result is a set of independent SVGs that preserve object identities and allow direct edits to individual elements without affecting the rest of the image. This addresses the limitation of standard vectorization tools that produce entangled and incomplete representations when objects overlap.

Core claim

AmodalSVG reformulates image vectorization as a two-stage process that first applies Semantic Layer Peeling to decompose an input image into amodally complete semantic layers through VLM-guided progressive decomposition and hybrid inpainting to recover occluded object appearances, then converts those layers into independent SVGs using Adaptive Layered Vectorization that allocates primitives according to an error budget.

What carries the argument

Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers while using hybrid inpainting to recover complete object appearances under occlusions.

If this is right

  • The resulting amodal layers support object-level editing directly in the vector domain.
  • SVGs become semantically organized rather than entangled across overlapping objects.
  • Geometric completeness is achieved for each object rather than only tracing visible pixels.
  • Independent vectorization of layers enables capabilities absent from prior modal vectorization methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Design software could incorporate amodal layers as a native editing mode so users adjust hidden parts of objects without redrawing.
  • The approach may extend naturally to video by propagating layer decompositions across frames to maintain consistent editable vectors over time.
  • Automated illustration pipelines could use the complete layers to generate variants where objects are rearranged or restyled while preserving vector editability.

Load-bearing premise

The vision-language model guided peeling step must produce accurate semantic separation and artifact-free inpainted appearances for occluded regions so that the subsequent vectorization remains high quality.

What would settle it

A controlled test set of images containing objects with known full geometries where the output amodal SVGs are checked for geometric fidelity in hidden regions and for whether manual edits to individual layers produce visually consistent results without introducing new artifacts.

Figures

Figures reproduced from arXiv: 2604.10940 by Anran Qi, Buyu Li, Dong Xu, Guotao Liang, Juncheng Hu, Qian Yu, Sheng Wang, Ziteng Xue.

Figure 1
Figure 1. Figure 1: AmodalSVG vectorizes a raster image into semantically decoupled and amodally complete vector layers (e.g., the unoccluded lighthouse and its shadow on the top left, and the full giraffe on the top right). The resulting vector layers enable object-level editing–such as Reposition , Recolor , Replace , Resize , Reorder, and Remove – directly in the vector domain while maintaining global visual consistency (b… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the AmodalSVG. Our framework adopts a two-stage pipeline: raster-level semantic decoupling and completion, followed by layer-wise vectorization. (Left) For semantic decoupling and completion, Semantic Layer Peeling (SLP) it￾eratively identifies entities via VLMs to guide SAM segmentation. A hybrid inpainting mechanism then reconstitutes occluded regions to produce structurally integral raster l… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of SVG reconstruction fidelity. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of SVG layering quality. [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation of Semantic Layer Peeling (SLP) strategies. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative ablation of ALV. Content￾aware initialization prevents convergence failures and facilitates structural recovery. Primitive addi￾tion helps capture high-frequency details (e.g., melon net), and primitive pruning removes redundant prim￾itives for efficiency. Our full ALV achieves superior visual fidelity with an optimized primitive budget [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Structural editability and downstream applications. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

We introduce AmodalSVG, a new framework for amodal image vectorization that produces semantically organized and geometrically complete SVG representations from natural images. Existing vectorization methods operate under a modal paradigm: tracing only visible pixels and disregarding occlusion. Consequently, the resulting SVGs are semantically entangled and geometrically incomplete, limiting SVG's structural editability. In contrast, AmodalSVG reconstructs full object geometries, including occluded regions, into independent, editable vector layers. To achieve this, AmodalSVG reformulates image vectorization as a two-stage framework, performing semantic decoupling and completion in the raster domain to produce amodally complete semantic layers, which are then independently vectorized. In the first stage, we introduce Semantic Layer Peeling (SLP), a VLM-guided strategy that progressively decomposes an image into semantically coherent layers. By hybrid inpainting, SLP recovers complete object appearances under occlusions, enabling explicit semantic decoupling. To vectorize these layers efficiently, we propose Adaptive Layered Vectorization (ALV), which dynamically modulates the primitive budget via an error-budget-driven adjustment mechanism. Extensive experiments demonstrate that AmodalSVG significantly outperforms prior methods in visual fidelity. Moreover, the resulting amodal layers enable object-level editing directly in the vector domain, capabilities not supported by existing vectorization approaches. Code will be released upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AmodalSVG, a two-stage framework for amodal image vectorization from natural images. It first applies Semantic Layer Peeling (SLP), a VLM-guided progressive decomposition with hybrid inpainting to recover complete object appearances under occlusion and produce semantically decoupled raster layers; these are then independently vectorized via Adaptive Layered Vectorization (ALV), which modulates primitive budgets using an error-budget threshold. The result is a set of independent, geometrically complete SVG layers that support object-level editing in the vector domain, unlike existing modal vectorization methods.

Significance. If the central claims hold, the work would be significant for extending vector graphics beyond visible surfaces, enabling new editing workflows in design and graphics applications. The SLP and ALV components introduce a novel pipeline for semantic decoupling and adaptive vectorization; the planned code release would further strengthen reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim that AmodalSVG 'significantly outperforms prior methods in visual fidelity' and enables 'capabilities not supported by existing vectorization approaches' is unsupported by any quantitative metrics, baselines, ablation results, or dataset details, which are load-bearing for the central claim of amodal completeness and editability.
  2. [SLP stage] SLP stage: the VLM-guided semantic decoupling and hybrid inpainting for occluded regions lacks any reported metrics isolating completion quality (e.g., amodal mask IoU, perceptual consistency of inpainted textures, or error propagation into ALV), leaving the independence and geometric completeness of the resulting layers under-supported given known VLM hallucination risks under heavy occlusion.
minor comments (1)
  1. Define all acronyms (VLM, SLP, ALV) at first use and ensure consistent notation for the error-budget threshold across the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on where the supporting evidence appears in the paper and noting where revisions will be made to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that AmodalSVG 'significantly outperforms prior methods in visual fidelity' and enables 'capabilities not supported by existing vectorization approaches' is unsupported by any quantitative metrics, baselines, ablation results, or dataset details, which are load-bearing for the central claim of amodal completeness and editability.

    Authors: The abstract is a concise summary of the paper's contributions and conclusions. The quantitative support for these claims—including comparisons to prior vectorization methods on visual fidelity metrics (e.g., PSNR, SSIM, LPIPS), ablation studies on the SLP and ALV components, baseline implementations, and dataset details—is provided in full in Section 4 (Experiments) and Section 3 (Datasets and Implementation). Editability is demonstrated through both qualitative object-level editing examples and a user study in Section 4.3. We do not believe the abstract itself requires expansion given typical length constraints, as the load-bearing evidence resides in the body of the paper. revision: no

  2. Referee: [SLP stage] SLP stage: the VLM-guided semantic decoupling and hybrid inpainting for occluded regions lacks any reported metrics isolating completion quality (e.g., amodal mask IoU, perceptual consistency of inpainted textures, or error propagation into ALV), leaving the independence and geometric completeness of the resulting layers under-supported given known VLM hallucination risks under heavy occlusion.

    Authors: We acknowledge that isolated metrics for the SLP stage would provide stronger support. The current manuscript reports end-to-end results and qualitative layer visualizations in Section 4.2, which show improved semantic independence and geometric completeness relative to modal baselines. However, we did not report standalone amodal completion metrics such as mask IoU or texture perceptual scores, partly because standard datasets lack amodal ground truth. We will add a dedicated ablation subsection with perceptual metrics (e.g., FID for inpainted regions) and explicit discussion of hallucination mitigation via the hybrid inpainting strategy. This revision will be incorporated. revision: yes

Circularity Check

0 steps flagged

No circularity in the proposed AmodalSVG pipeline

full rationale

The paper describes a two-stage algorithmic framework (SLP for VLM-guided semantic decomposition plus hybrid inpainting, followed by ALV for error-budget-driven vectorization) rather than any mathematical derivation chain, first-principles result, or prediction. No equations, fitted parameters renamed as outputs, self-definitional constructs, or load-bearing self-citations appear in the abstract or described method. The central claims rest on the empirical performance of the pipeline components, which are presented as novel but externally motivated techniques, not reductions to their own inputs by construction. This is a standard non-circular method paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The framework rests on the assumption that current VLMs and inpainting techniques can produce sufficiently accurate semantic layers and completions; no free parameters are explicitly named in the abstract, but the primitive budget modulation in ALV implies tunable thresholds.

free parameters (1)
  • error budget threshold
    Used to dynamically adjust primitive count during vectorization of each layer.
axioms (1)
  • domain assumption VLMs can guide accurate progressive semantic decomposition and occlusion recovery via hybrid inpainting
    Invoked as the core mechanism of Semantic Layer Peeling.
invented entities (2)
  • Semantic Layer Peeling (SLP) no independent evidence
    purpose: Progressively decompose image into semantically coherent layers while recovering occluded regions
    New VLM-guided strategy introduced for the first stage.
  • Adaptive Layered Vectorization (ALV) no independent evidence
    purpose: Vectorize each amodal layer with dynamic primitive budget control
    New mechanism for the second stage to balance fidelity and complexity.

pith-pipeline@v0.9.0 · 5557 in / 1262 out tokens · 33403 ms · 2026-05-10T16:32:43.477412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Reference graph

Works this paper leans on

60 extracted references · 9 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    IEEE transactions on pattern analysis and machine intelligence34(11), 2274–2282 (2012) 4

    Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpix- els compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence34(11), 2274–2282 (2012) 4

  2. [2]

    AI,R.:Aiimagevectorizer:Convertrasterimagestovectorgraphics(2024),https: //www.recraft.ai/ai-image-vectorizer10

  3. [3]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ao, J., Jiang, Y., Ke, Q., Ehinger, K.A.: Open-world amodal appearance comple- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6490–6499 (2025) 4

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  5. [5]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020) 6

  6. [6]

    Advances in Neural Information Processing Systems (NeurIPS)33, 16351–16361 (2020) 3, 4

    Carlier, A., Danelljan, M., Alahi, A., Timofte, R.: Deepsvg: A hierarchical gen- erative network for vector graphics animation. Advances in Neural Information Processing Systems (NeurIPS)33, 16351–16361 (2020) 3, 4

  7. [7]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Chen, H., Zhao, Z., Chen, Y., Liang, Z., Ni, B.: Svgthinker: Instruction-aligned and reasoning-driven text-to-svg generation. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 11004–11012 (2025) 4

  8. [8]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, Y., Hu, Z., Zhao, Z., Zhu, Y., Shi, Y., Xiong, Y., Ni, B.: Easy-editable image vectorization with multi-layer multi-scale distributed visual feature embedding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23345–23354 (2025) 6

  9. [9]

    Consortium, W.W.W.: Scalable vector graphics (svg) specification (1999),https: //www.w3.org/TR/1999/WD-SVG-19990211/2

  10. [10]

    In: Advances in Neural Information Pro- cessing Systems (NeurIPS) (2022) 3

    Frans, K., Soros, L., Witkowski, O.: CLIPDraw: Exploring text-to-drawing syn- thesis through language-image encoders. In: Advances in Neural Information Pro- cessing Systems (NeurIPS) (2022) 3

  11. [11]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Gao, J., Qian, X., Wang, Y., Xiao, T., He, T., Zhang, Z., Fu, Y.: Coarse-to-fine amodal segmentation with shape prior. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 1262–1271 (2023) 4

  12. [12]

    In: International Conference on Learning Representations (ICLR) (2018),https://openreview

    Ha, D., Eck, D.: A neural representation of sketch drawings. In: International Conference on Learning Representations (ICLR) (2018),https://openreview. net/forum?id=Hy6GHpkCW3

  13. [13]

    In: Proceedings of the AAAI Conference on Artificial In- telligence

    Hirschorn, O., Jevnisek, A., Avidan, S.: Optimize & reduce: a top-down approach for image vectorization. In: Proceedings of the AAAI Conference on Artificial In- telligence. vol. 38, pp. 2148–2156 (2024) 2, 3, 4, 7, 10, 13

  14. [14]

    In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition

    Hu, T., Yi, R., Qian, B., Zhang, J., Rosin, P.L., Lai, Y.K.: Supersvg: Superpixel- based scalable vector graphics synthesis. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition. pp. 24892–24901 (2024) 4, 6 16 J. Hu et al

  15. [15]

    ACM Transactions on Graphics (TOG)42(4), 1–11 (2023) 3

    Iluz, S., Vinker, Y., Hertz, A., Berio, D., Cohen-Or, D., Shamir, A.: Word-as- image for semantic typography. ACM Transactions on Graphics (TOG)42(4), 1–11 (2023) 3

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition

    Jain, A., Xie, A., Abbeel, P.: Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition. pp. 1911–1920 (2023) 3

  17. [17]

    In: Psychology of learning and motivation

    Kellman, P.J., Massey, C.M.: Perceptual learning, cognition, and expertise. In: Psychology of learning and motivation. Elsevier (2013) 4

  18. [18]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023) 4, 6, 19

  19. [19]

    black forest lab: Flux.1 fill [pro] (2024),https://huggingface.co/black-forest- labs/FLUX.1-Fill-dev6, 19, 22

  20. [20]

    ACM Transactions on Graphics (TOG) 39(6), 1–15 (2020) 2, 3, 4, 7, 10, 13, 19, 22

    Li, T.M., Lukáč, M., Gharbi, M., Ragan-Kelley, J.: Differentiable vector graph- ics rasterization for editing and learning. ACM Transactions on Graphics (TOG) 39(6), 1–15 (2020) 2, 3, 4, 7, 10, 13, 19, 22

  21. [21]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Li, Z., Lavreniuk, M., Shi, J., Bhat, S.F., Wonka, P.: Amodal depth anything: Amodal depth estimation in the wild. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9673–9682 (2025) 4

  22. [22]

    In: European Conference on Computer Vision

    Li, Z., Ye, W., Jiang, T., Huang, T.: 2d amodal instance segmentation guided by 3d shape prior. In: European Conference on Computer Vision. pp. 165–181. Springer (2022) 4

  23. [23]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024) 19

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ma, X., Zhou, Y., Xu, X., Sun, B., Filev, V., Orlov, N., Fu, Y., Shi, H.: Towards layer-wise image vectorization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16314–16323 (2022) 2, 3, 4, 10, 11, 13

  25. [25]

    ACM Transactions on Graphics (ToG)27(3), 1–8 (2008) 4

    Orzan, A., Bousseau, A., Winnemöller, H., Barla, P., Thollot, J., Salesin, D.: Diffu- sion curves: a vector representation for smooth-shaded images. ACM Transactions on Graphics (ToG)27(3), 1–8 (2008) 4

  26. [26]

    In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Ozguroglu, E., Liu, R., Surís, D., Chen, D., Dave, A., Tokmakov, P., Vondrick, C.: pix2gestalt: Amodal segmentation by synthesizing wholes. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3931–3940. IEEE Computer Society (2024) 4

  27. [27]

    Proceedings of the ACM on Computer Graphics and Interactive Techniques7(1), 1–17 (2024) 7

    Papantonakis, P., Kopanas, G., Kerbl, B., Lanvin, A., Drettakis, G.: Reducing the memory footprint of 3d gaussian splatting. Proceedings of the ACM on Computer Graphics and Interactive Techniques7(1), 1–17 (2024) 7

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Qi, L., Jiang, L., Liu, S., Shen, X., Jia, J.: Amodal instance segmentation with kins dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3014–3023 (2019) 4

  29. [29]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Reddy, P., Gharbi, M., Lukac, M., Mitra, N.J.: Im2vec: Synthesizing vector graph- ics without vector supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7342–7351 (2021) 2, 3, 4, 13

  30. [30]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 779–788 (2016) 6 AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling 17

  31. [31]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Ren, T., Liu, S., Zeng, A., Lin, J., Li, K., Cao, H., Chen, J., Huang, X., Chen, Y., Yan, F., et al.: Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159 (2024) 6, 19, 21

  32. [32]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Rodriguez, J.A., Puri, A., Agarwal, S., Laradji, I.H., Rodriguez, P., Rajeswar, S., Vazquez, D., Pal, C., Pedersoli, M.: Starvector: Generating scalable vector graphics code from images and text. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16175–16186 (2025) 4

  33. [33]

    A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M

    Rodriguez, J.A., Zhang, H., Puri, A., Feizi, A., Pramanik, R., Wichmann, P., Mondal, A., Samsami, M.R., Awal, R., Taslakian, P., et al.: Rendering-aware rein- forcement learning for vector graphics generation. arXiv preprint arXiv:2505.20793 (2025) 4

  34. [34]

    Selinger, P.: Potrace: a polygon-based tracing algorithm (2003) 2, 4, 13

  35. [35]

    Layertracer: Cognitive-aligned layered svg synthesis via diffusion transformer.arXiv preprint arXiv:2502.01105, 2025

    Song, Y., Chen, D., Shou, M.Z.: Layertracer: Cognitive-aligned layered svg syn- thesis via diffusion transformer. arXiv preprint arXiv:2502.01105 (2025) 2, 4, 10, 11, 13

  36. [36]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 2149–2159 (2022) 6, 19, 22

  37. [37]

    In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision

    Vinker, Y., Alaluf, Y., Cohen-Or, D., Shamir, A.: Clipascene: Scene sketching with different types and levels of abstraction. In: Proceedings of the IEEE/CVF Inter- national Conference on Computer Vision. pp. 4146–4156 (2023) 3

  38. [38]

    ACM Transactions on Graphics (TOG)41(4), 1–11 (2022) 3

    Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen- Or, D., Zamir, A., Shamir, A.: Clipasso: Semantically-aware object sketching. ACM Transactions on Graphics (TOG)41(4), 1–11 (2022) 3

  39. [39]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Vinker, Y., Shaham, T.R., Zheng, K., Zhao, A., E Fan, J., Torralba, A.: Sketcha- gent: Language-driven sequential sketch generation. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 23355–23368 (2025) 4

  40. [40]

    Visioncortex: vtracer (2020),https://www.visioncortex.org/vtracer-docs2, 3, 4, 10, 13

  41. [41]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Wang,F.,Zhao,Z.,Liu,Y.,Zhang,D.,Gao,J.,Sun,H.,Li,X.:Svgen:Interpretable vector graphics generation with large language models. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 9608–9617 (2025) 4

  42. [42]

    arXiv preprint arXiv:2510.11341 (2025)

    Wang, H., Yin, J., Wei, Q., Zeng, W., Gu, L., Ye, S., Gao, Z., Wang, Y., Zhang, Y., Li, Y., et al.: Internsvg: Towards unified svg tasks with multimodal large language models. arXiv preprint arXiv:2510.11341 (2025) 4

  43. [43]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Z., Huang, J., Sun, Z., Gong, Y., Cohen-Or, D., Lu, M.: Layered image vectorization via semantic simplification. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7728–7738 (2025) 2, 4, 10, 11, 13

  44. [44]

    IEEE transactions on image processing 13(4), 600–612 (2004) 10, 22, 23

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004) 10, 22, 23

  45. [45]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wu, R., Su, W., Liao, J.: Chat2svg: Vector graphics generation with large language models and image diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23690–23700 (2025) 4

  46. [46]

    In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 2, 4, 10, 11, 13

    Wu, R., Su, W., Liao, J.: Layerpeeler: Autoregressive peeling for layer-wise image vectorization. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers (2025) 2, 4, 10, 11, 13

  47. [47]

    Reason-SVG: Enhancing Structured Reasoning for Vector Graphics Generation with Reinforcement Learning

    Xing, X., Guan, Y., Zhang, J., Xu, D., Yu, Q.: Reason-svg: Hybrid reward rl for aha-moments in vector graphics generation. arXiv preprint arXiv:2505.24499 (2025) 4 18 J. Hu et al

  48. [48]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xing, X., Hu, J., Liang, G., Zhang, J., Xu, D., Yu, Q.: Empowering llms to un- derstand and generate complex vector graphics. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19487–19497 (2025) 4

  49. [49]

    Advances in Neural Infor- mation Processing Systems36, 15869–15889 (2023) 3

    Xing, X., Wang, C., Zhou, H., Zhang, J., Yu, Q., Xu, D.: Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. Advances in Neural Infor- mation Processing Systems36, 15869–15889 (2023) 3

  50. [50]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3, 7

    Xing, X., Yu, Q., Wang, C., Zhou, H., Zhang, J., Xu, D.: Svgdreamer++: Advanc- ing editability and diversity in text-guided svg generation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025) 3, 7

  51. [51]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xing, X., Zhou, H., Wang, C., Zhang, J., Xu, D., Yu, Q.: Svgdreamer: Text guided svg generation with diffusion model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4546–4555 (2024) 3

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xu, K., Zhang, L., Shi, J.: Amodal completion via progressive mixed context dif- fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9099–9109 (2024) 4

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Xue, Z., Guo, M., Fan, H., Zhang, S., Zhang, Z.: Corrbev: Multi-view 3d object detection by correlation learning with multi-modal prototypes. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 27413–27423 (2025) 4

  54. [54]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 4

    Yang, Y., Cheng, W., Chen, S., Zeng, X., Yin, F., Zhang, J., Wang, L., Yu, G., Ma, X., Jiang, Y.G.: Omnisvg: A unified scalable vector graphics generation model. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 4

  55. [55]

    Qwen-image-layered: Towards inherent editability via layer decomposition.arXiv preprint arXiv:2512.15603, 2025

    Yin, S., Zhang, Z., Tang, Z., Gao, K., Xu, X., Yan, K., Li, J., Chen, Y., Chen, Y., Shum, H.Y., et al.: Qwen-image-layered: Towards inherent editability via layer decomposition. arXiv preprint arXiv:2512.15603 (2025) 11

  56. [56]

    arXiv preprint arXiv:2512.10894 (2025)

    Zhang, P., Zhao, N., Fisher, M., Xu, Y., Liao, J., Liu, D.: Duetsvg: Uni- fied multimodal svg generation with internal visual guidance. arXiv preprint arXiv:2512.10894 (2025) 4

  57. [57]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018) 10, 22, 23

  58. [58]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Zhao, K., Bao, L., Li, Y., Su, X., Zhang, K., Qiao, X.: Less is more: Efficient image vectorization with adaptive parameterization. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18166–18175 (2025) 2, 4, 10, 11, 13

  59. [59]

    In: European Conference on Computer Vision

    Zhou, H., Zhang, H., Wang, B.: Segmentation-guided layer-wise image vectoriza- tion with gradient fills. In: European Conference on Computer Vision. pp. 165–180. Springer (2024) 2, 4, 10, 11, 13

  60. [60]

    left dog

    Zhu, H., Chong, J.I., Hu, T., Yi, R., Lai, Y.K., Rosin, P.L.: Samvg: A multi-stage image vectorization model with the segment-anything model. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 4350–4354. IEEE (2024) 3, 4, 6 AmodalSVG: Amodal Image Vectorization via Semantic Layer Peeling 19 Supplem...