pith. sign in

arxiv: 2605.20941 · v1 · pith:XCXBJATDnew · submitted 2026-05-20 · 💻 cs.CV · cs.GR· cs.HC

PaintCopilot: Modeling Painting as Autonomous Artistic Continuation

Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.HC
keywords neural paintingautoregressive generationco-creative AIbrushstroke predictionflow matchinggenerative modelsinteractive artcomputer vision
0
0 comments X

The pith

PaintCopilot models painting as an open-ended autoregressive process that generates the next brushstroke from the current canvas and stroke history without a target image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces PaintCopilot, a neural system that treats the act of painting as a sequence of decisions made one stroke at a time, much like generating the next word in a sentence. The models learn artistic patterns from how canvases evolve and what strokes have come before, rather than trying to match a fixed reference picture. A key part is using a vision transformer to guess what the artist is aiming for from the unfinished work, then using that to produce coherent next strokes with flow matching. Professional artists tested the tool in interactive sessions where control passes back and forth between human and AI. If this holds, it opens a path for AI to participate in creative processes as a partner that adapts to the direction the work is taking rather than enforcing an end goal.

Core claim

The paper establishes that painting can be modeled as autonomous artistic continuation by predicting future strokes directly from learned artistic dynamics conditioned on evolving canvas states and prior brushstroke history. It does this through three models: a ViT-based Target Predictor that infers artist intent from partial observations, an autoregressive Next Stroke Predictor using flow matching to generate temporally coherent brushstrokes, and a VAE-based Region Sampler for on-demand localized sequences. This enables four interactive workflows—Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush—demonstrated in case studies with professional artists to support fluid,

What carries the argument

The autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, conditioned on the ViT-based Target Predictor output and the history of canvas states plus prior strokes.

If this is right

  • The system supports an Optimize History workflow that refines earlier brushstroke decisions based on later canvas state.
  • Stroke Completion allows the model to extend an interrupted sequence of brushstrokes while preserving temporal coherence.
  • Region Inpainting lets the VAE-based sampler synthesize new stroke sequences in user-specified canvas areas on demand.
  • Dynamic Brush mode enables real-time switching among Hard Round, Brush Tip, and 2D Gaussian representations during co-creation.
  • Case studies show artists and the AI can alternate control throughout an entire painting session.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same autoregressive framing could extend to other time-based creative domains such as musical composition or sequential sculpture.
  • Accumulating longer interaction histories might let the models adapt to an individual artist's personal style over multiple sessions.
  • This approach could shift digital art tools from single-prompt generation toward sustained collaborative processes used in education or therapy.
  • Training on larger collections of recorded artist painting sessions would be a direct way to test and improve the Target Predictor's accuracy.

Load-bearing premise

The ViT-based Target Predictor can reliably infer artist intent from partial canvas observations to condition the autoregressive Next Stroke Predictor.

What would settle it

If artists using the system in repeated sessions consistently find that the generated strokes do not match their evolving intent even after several turns of interaction, that would show the intent inference step fails to capture artistic dynamics.

Figures

Figures reproduced from arXiv: 2605.20941 by Paul Pu Liang, Yuancheng Shen, Yunge Wen.

Figure 1
Figure 1. Figure 1: PaintCopilot is a co-creative neural painting system that models painting as autonomous artistic continuation conditioned on evolving canvas states [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Statistical analysis of human painting behavior collected from real digital painting sessions. (a) Example brushstroke trajectories recorded from [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Brush representations supported by PaintCopilot. The system in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example portrait paintings from the curated training dataset. Portrait [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stroke-sequence generation pipeline for dataset construction. Se [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of the PaintCopilot architecture. The Target Predictor estimates the artist’s intended outcome from the evolving canvas state. The Next [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: User interface of PaintCopilot during interactive painting. Artists [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Dynamic target-intent prediction during painting progression. As [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Target Predictor evaluation on 100 held-out paintings. Prediction [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Autoregressive Stroke Predictor per-dimension prediction error. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
read the original abstract

We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. It uses a ViT-based Target Predictor to infer artist intent, an autoregressive Next Stroke Predictor with flow matching for generating brushstrokes, and a VAE-based Region Sampler for localized sequences. The system supports four interactive workflows and is demonstrated via case studies with professional artists.

Significance. If the results hold, this approach represents a significant shift from target-driven reconstruction to autonomous continuation in neural painting, drawing an analogy to LLM text generation. This could open new avenues for co-creative tools in computer vision and digital art, with practical support for interactive workflows using differentiable brush models.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Evaluation): The manuscript relies solely on qualitative case studies with professional artists and supplies no quantitative metrics, ablation studies, or baseline comparisons. This leaves the reliability of the ViT-based Target Predictor for inferring evolving artist intent from partial observations untested, which is load-bearing for the central autoregressive continuation claim without a target image.
minor comments (1)
  1. [§3.1] §3.1: Specify the precise integration of the Target Predictor output into the flow-matching Next Stroke Predictor (e.g., concatenation, cross-attention, or conditioning vector) to clarify the autoregressive mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address the major comment on evaluation below.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): The manuscript relies solely on qualitative case studies with professional artists and supplies no quantitative metrics, ablation studies, or baseline comparisons. This leaves the reliability of the ViT-based Target Predictor for inferring evolving artist intent from partial observations untested, which is load-bearing for the central autoregressive continuation claim without a target image.

    Authors: We acknowledge that the current evaluation relies on qualitative case studies with professional artists rather than quantitative metrics or ablations. This design choice reflects the open-ended, subjective nature of autonomous artistic continuation, where no canonical ground-truth target or continuation exists, rendering standard reconstruction metrics inapplicable. The ViT Target Predictor is assessed indirectly via its role in enabling coherent interactive workflows, as confirmed by artist feedback on intent alignment and stroke plausibility. We agree that explicit ablations would strengthen the presentation. In revision we will add component ablations (e.g., full model versus variants without the Target Predictor) using proxy measures such as stroke-sequence consistency on held-out artist sessions and user preference ratings. Direct baselines are limited because prior neural painting methods require target images; we will expand the related-work discussion to clarify this distinction while retaining the qualitative expert validation as primary evidence. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework uses standard learned components without self-referential reductions

full rationale

The derivation chain models painting as autoregressive continuation conditioned on canvas state and stroke history, using a ViT-based Target Predictor to infer intent, flow-matching Next Stroke Predictor, and VAE Region Sampler. These are trained neural modules built on established architectures (ViT, flow matching, VAE) and differentiable brush representations. No equations or claims reduce a prediction to its own fitted inputs by construction, nor does any load-bearing step rely on self-citation chains or imported uniqueness theorems. The open-ended behavior without a target image is achieved through learned dynamics rather than definitional equivalence, making the central claim self-contained with independent empirical content from the proposed workflows and case studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract provides limited technical detail; the core modeling choice is treated as a domain assumption rather than derived from first principles.

axioms (1)
  • domain assumption Painting can be modeled as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history.
    This is the foundational premise stated directly in the abstract as the basis for the entire framework.

pith-pipeline@v0.9.0 · 5723 in / 1330 out tokens · 37780 ms · 2026-05-21T05:26:17.750793+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

  1. [1]

    Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. 2022. Deep ViT Features as Dense Visual Descriptors. arXiv:2112.05814 [cs.CV] https://arxiv.org/abs/2112. 05814

  2. [2]

    Gwangbin Bae and Andrew J. Davison. 2024. Rethinking Inductive Biases for Surface Normal Estimation. arXiv:2403.00712 [cs.CV] https://arxiv.org/abs/2403. 00712

  3. [3]

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. arXiv:2104.14294 [cs.CV] https://arxiv.org/abs/2104.14294

  4. [4]

    Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. 2018. Synthesizing Programs for Images Using Reinforced Adversarial Learning. InProceedings of the 35th International Conference on Machine Learning (2018-07-03). PMLR, 1666–1675. https://proceedings.mlr.press/v80/ganin18a.html

  5. [5]

    A Neural Representation of Sketch Drawings

    David Ha and Douglas Eck. 2017.A Neural Representation of Sketch Drawings. arXiv:1704.03477 [cs] doi:10.48550/arXiv.1704.03477

  6. [6]

    David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. arXiv preprint arXiv:1704.03477(2017)

  7. [7]

    Teng Hu, Ran Yi, Haokun Zhu, Liang Liu, Jinlong Peng, Yabiao Wang, Chengjie Wang, and Lizhuang Ma. 2023. Stroke-Based Neural Painting and Stylization with Dynamically Predicted Painting Region. InProceedings of the 31st ACM International Conference on Multimedia(New York, NY, USA, 2023-10-27)(MM ’23). Association for Computing Machinery, 7470–7480. doi:10...

  8. [8]

    Zhangli Hu, Ye Chen, Zhongyin Zhao, Jinfan Liu, Bilian Ke, and Bingbing Ni

  9. [9]

    InProceedings of the 32nd ACM International Conference on Multimedia (New York, NY, USA, 2024-10-28)(MM ’24)

    Towards Artist-Like Painting Agents with Multi-Granularity Semantic Alignment. InProceedings of the 32nd ACM International Conference on Multimedia (New York, NY, USA, 2024-10-28)(MM ’24). Association for Computing Machinery, 10191–10199. doi:10.1145/3664647.3681245

  10. [10]

    Xin Huang and Minglun Gong. 2025. Attention-Guided Deep Reinforcement Learning for Realistic Neural Painting. 13 (2025), 99291–99302. doi:10.1109/ ACCESS.2025.3576500

  11. [11]

    2019.Learning to Paint With Model-based Deep Reinforcement Learning

    Zhewei Huang, Wen Heng, and Shuchang Zhou. 2019.Learning to Paint With Model-based Deep Reinforcement Learning. arXiv:1903.04411 [cs] doi:10.48550/ arXiv.1903.04411

  12. [12]

    Dmytro Kotovenko, Matthias Wright, Arthur Heimbrecht, and Bjorn Ommer. 2021. Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes. 12196– 12205. https://openaccess.thecvf.com/content/CVPR2021/html/Kotovenko_ Rethinking_Style_Transfer_From_Pixels_to_Parameterized_Brushstrokes_ CVPR_2021_paper.html

  13. [13]

    Whitney, Pushmeet Kohli, and Josh Tenenbaum

    Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. InAdvances in Neural Information Processing Systems(2015), Vol. 28. Curran Associates, Inc. https://proceedings. neurips.cc/paper/2015/hash/ced556cd9f9c0c8315cfbe0744a3baf0-Abstract.html

  14. [14]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747

  15. [15]

    Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, Ruifeng Deng, Xin Li, Errui Ding, and Hao Wang. 2021. Paint Transformer: Feed Forward Neural Painting With Stroke Prediction. 6598–6607. https://openaccess.thecvf.com/content/ICCV2021/ html/Liu_Paint_Transformer_Feed_Forward_Neural_Painting_With_Stroke_ Prediction_ICCV_2021_paper.html

  16. [16]

    Loper and Michael J

    Matthew M. Loper and Michael J. Black. 2014. OpenDR: An Approximate Differ- entiable Renderer. InComputer Vision – ECCV 2014(Cham, 2014), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, 154–169. doi:10.1007/978-3-319-10584-0_11

  17. [17]

    Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan- Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172 [cs.DC] https: //arxiv.org/abs/1906.08172

  18. [18]

    Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum

  19. [19]

    InAdvances in Neural Information Processing Systems(2013), Vol

    Approximate Bayesian Image Interpretation Using Generative Probabilistic Graphics Programs. InAdvances in Neural Information Processing Systems(2013), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/hash/ fa14d4fe2f19414de3ebd9f63d5c0169-Abstract.html

  20. [20]

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72

  21. [21]

    Florian Nolte, Andrew Melnik, and Helge Ritter. 2022. Stroke-based rendering: From heuristics to deep learning.arXiv preprint arXiv:2302.00595(2022)

  22. [22]

    Jeripothula Prudviraj and Vikram Jamwal. 2025. Vectorized Region Based Brush Strokes for Artistic Rendering.arXiv preprint arXiv:2506.09969(2025)

  23. [23]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV] https://arxiv.org/abs/2112.10752

  24. [24]

    Bowei Shao, Richard Adams, Aaron Hertzmann, and Constantine Caramanis. 2024. Inverse Painting: Reconstructing The Painting Process.ACM Transactions on Graphics43, 6 (2024)

  25. [25]

    Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. 2018. CSGNet: Neural Shape Parser for Constructive Solid Geometry. 5515–

  26. [26]

    https://openaccess.thecvf.com/content_cvpr_2018/html/Sharma_CSGNet_ Neural_Shape_CVPR_2018_paper.html

  27. [27]

    Jaskirat Singh, Cameron Smith, Jose Echevarria, and Liang Zheng. 2022. Intelli- Paint: Towards Developing More Human-Intelligible Painting Agents. InComputer Vision – ECCV 2022(Cham, 2022), Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, 685–701. doi:10.1007/978-3-031-19787-1_39

  28. [28]

    Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, and Mike Zheng Shou. 2024. Processpainter: Learn painting process from sequence data.arXiv preprint arXiv:2406.06062(2024)

  29. [29]

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865

  30. [30]

    Zhengyan Tong, Xiaohang Wang, Shengchao Yuan, Xuanhong Chen, Junjie Wang, and Xiangzhong Fang. 2022. Im2oil: Stroke-based oil painting rendering with linearly controllable fineness via adaptive sampling. InProceedings of the 30th ACM international conference on multimedia. 1035–1046

  31. [31]

    Yunnan Wang, Ziqiang Li, Wenyao Zhang, Lexiang Lv, Zequn Zhang, Xiaoyu Shen, Xin Jin, and Wenjun Zeng. 2025. Canvas: Compositional Generation for Art Painting With Seamless Subject-Driven Infusion.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  32. [32]

    Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. arXiv:1808.00897 [cs.CV] https://arxiv.org/abs/1808.00897

  33. [33]

    Jiong Zhang, Guangxin Xu, and Xiaoyan Zhang. 2025. HRL-Painter: Optimal Planning Painter Based on Hierarchical Reinforcement Learning. 636 (2025), 129972. doi:10.1016/j.neucom.2025.129972

  34. [34]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang

  35. [35]

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric

    The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924 [cs.CV] https://arxiv.org/abs/1801.03924

  36. [36]

    Zhengxia Zou, Tianyang Shi, Shuang Qiu, Yi Yuan, and Zhenwei Shi

  37. [37]

    15689–15698

    Stylized Neural Painting. 15689–15698. https://openaccess.thecvf. com/content/CVPR2021/html/Zou_Stylized_Neural_Painting_CVPR_ 2021_paper.html?utm_campaign=The+Batch&%3Butm_source=hs_email& %3Butm_medium=email&%3B_hsenc=p2ANqtz-9szYQCBxaf4USyuvLJ0t_ THBFH_u02CE2YetO5ca8sqHL5fpnjIFadN29wJyKLpyKkkcnF&ref=dl-staging- website.ghost.io , ,