PaintCopilot: Modeling Painting as Autonomous Artistic Continuation
Pith reviewed 2026-05-21 05:26 UTC · model grok-4.3
The pith
PaintCopilot models painting as an open-ended autoregressive process that generates the next brushstroke from the current canvas and stroke history without a target image.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that painting can be modeled as autonomous artistic continuation by predicting future strokes directly from learned artistic dynamics conditioned on evolving canvas states and prior brushstroke history. It does this through three models: a ViT-based Target Predictor that infers artist intent from partial observations, an autoregressive Next Stroke Predictor using flow matching to generate temporally coherent brushstrokes, and a VAE-based Region Sampler for on-demand localized sequences. This enables four interactive workflows—Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush—demonstrated in case studies with professional artists to support fluid,
What carries the argument
The autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, conditioned on the ViT-based Target Predictor output and the history of canvas states plus prior strokes.
If this is right
- The system supports an Optimize History workflow that refines earlier brushstroke decisions based on later canvas state.
- Stroke Completion allows the model to extend an interrupted sequence of brushstrokes while preserving temporal coherence.
- Region Inpainting lets the VAE-based sampler synthesize new stroke sequences in user-specified canvas areas on demand.
- Dynamic Brush mode enables real-time switching among Hard Round, Brush Tip, and 2D Gaussian representations during co-creation.
- Case studies show artists and the AI can alternate control throughout an entire painting session.
Where Pith is reading between the lines
- The same autoregressive framing could extend to other time-based creative domains such as musical composition or sequential sculpture.
- Accumulating longer interaction histories might let the models adapt to an individual artist's personal style over multiple sessions.
- This approach could shift digital art tools from single-prompt generation toward sustained collaborative processes used in education or therapy.
- Training on larger collections of recorded artist painting sessions would be a direct way to test and improve the Target Predictor's accuracy.
Load-bearing premise
The ViT-based Target Predictor can reliably infer artist intent from partial canvas observations to condition the autoregressive Next Stroke Predictor.
What would settle it
If artists using the system in repeated sessions consistently find that the generated strokes do not match their evolving intent even after several turns of interaction, that would show the intent inference step fails to capture artistic dynamics.
Figures
read the original abstract
We present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. Unlike existing neural painting methods that frame painting as pixel reconstruction toward a predefined reference, PaintCopilot predicts future strokes directly from learned artistic dynamics, analogous to how large language models continue text sequences from prior context. The framework proposes three complementary models: a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler that synthesizes semantically localized stroke sequences on demand. Built on three differentiable brush representations (Hard Round, Brush Tip, and 2D Gaussian), the system supports four interactive workflows: Optimize History, Stroke Completion, Region Inpainting, and Dynamic Brush. Through case studies with professional artists, we demonstrate that PaintCopilot enables fluid co-creative painting workflows in which artists and AI continuously alternate control throughout the creative process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to present PaintCopilot, a co-creative neural painting assistant that models painting as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history, without requiring a target image. It uses a ViT-based Target Predictor to infer artist intent, an autoregressive Next Stroke Predictor with flow matching for generating brushstrokes, and a VAE-based Region Sampler for localized sequences. The system supports four interactive workflows and is demonstrated via case studies with professional artists.
Significance. If the results hold, this approach represents a significant shift from target-driven reconstruction to autonomous continuation in neural painting, drawing an analogy to LLM text generation. This could open new avenues for co-creative tools in computer vision and digital art, with practical support for interactive workflows using differentiable brush models.
major comments (1)
- [Abstract and §4] Abstract and §4 (Evaluation): The manuscript relies solely on qualitative case studies with professional artists and supplies no quantitative metrics, ablation studies, or baseline comparisons. This leaves the reliability of the ViT-based Target Predictor for inferring evolving artist intent from partial observations untested, which is load-bearing for the central autoregressive continuation claim without a target image.
minor comments (1)
- [§3.1] §3.1: Specify the precise integration of the Target Predictor output into the flow-matching Next Stroke Predictor (e.g., concatenation, cross-attention, or conditioning vector) to clarify the autoregressive mechanism.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address the major comment on evaluation below.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Evaluation): The manuscript relies solely on qualitative case studies with professional artists and supplies no quantitative metrics, ablation studies, or baseline comparisons. This leaves the reliability of the ViT-based Target Predictor for inferring evolving artist intent from partial observations untested, which is load-bearing for the central autoregressive continuation claim without a target image.
Authors: We acknowledge that the current evaluation relies on qualitative case studies with professional artists rather than quantitative metrics or ablations. This design choice reflects the open-ended, subjective nature of autonomous artistic continuation, where no canonical ground-truth target or continuation exists, rendering standard reconstruction metrics inapplicable. The ViT Target Predictor is assessed indirectly via its role in enabling coherent interactive workflows, as confirmed by artist feedback on intent alignment and stroke plausibility. We agree that explicit ablations would strengthen the presentation. In revision we will add component ablations (e.g., full model versus variants without the Target Predictor) using proxy measures such as stroke-sequence consistency on held-out artist sessions and user preference ratings. Direct baselines are limited because prior neural painting methods require target images; we will expand the related-work discussion to clarify this distinction while retaining the qualitative expert validation as primary evidence. revision: partial
Circularity Check
No significant circularity; framework uses standard learned components without self-referential reductions
full rationale
The derivation chain models painting as autoregressive continuation conditioned on canvas state and stroke history, using a ViT-based Target Predictor to infer intent, flow-matching Next Stroke Predictor, and VAE Region Sampler. These are trained neural modules built on established architectures (ViT, flow matching, VAE) and differentiable brush representations. No equations or claims reduce a prediction to its own fitted inputs by construction, nor does any load-bearing step rely on self-citation chains or imported uniqueness theorems. The open-ended behavior without a target image is achieved through learned dynamics rather than definitional equivalence, making the central claim self-contained with independent empirical content from the proposed workflows and case studies.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Painting can be modeled as an open-ended autoregressive artistic behavior conditioned on evolving canvas states and prior brushstroke history.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a ViT-based Target Predictor that infers artist intent from partial canvas observations, an autoregressive Next Stroke Predictor that generates temporally coherent brushstrokes via flow matching, and a VAE-based Region Sampler
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging Properties in Self-Supervised Vision Transformers. arXiv:2104.14294 [cs.CV] https://arxiv.org/abs/2104.14294
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Yaroslav Ganin, Tejas Kulkarni, Igor Babuschkin, S. M. Ali Eslami, and Oriol Vinyals. 2018. Synthesizing Programs for Images Using Reinforced Adversarial Learning. InProceedings of the 35th International Conference on Machine Learning (2018-07-03). PMLR, 1666–1675. https://proceedings.mlr.press/v80/ganin18a.html
work page 2018
-
[5]
A Neural Representation of Sketch Drawings
David Ha and Douglas Eck. 2017.A Neural Representation of Sketch Drawings. arXiv:1704.03477 [cs] doi:10.48550/arXiv.1704.03477
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1704.03477 2017
-
[6]
David Ha and Douglas Eck. 2017. A Neural Representation of Sketch Drawings. arXiv preprint arXiv:1704.03477(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
Teng Hu, Ran Yi, Haokun Zhu, Liang Liu, Jinlong Peng, Yabiao Wang, Chengjie Wang, and Lizhuang Ma. 2023. Stroke-Based Neural Painting and Stylization with Dynamically Predicted Painting Region. InProceedings of the 31st ACM International Conference on Multimedia(New York, NY, USA, 2023-10-27)(MM ’23). Association for Computing Machinery, 7470–7480. doi:10...
-
[8]
Zhangli Hu, Ye Chen, Zhongyin Zhao, Jinfan Liu, Bilian Ke, and Bingbing Ni
-
[9]
Towards Artist-Like Painting Agents with Multi-Granularity Semantic Alignment. InProceedings of the 32nd ACM International Conference on Multimedia (New York, NY, USA, 2024-10-28)(MM ’24). Association for Computing Machinery, 10191–10199. doi:10.1145/3664647.3681245
- [10]
-
[11]
2019.Learning to Paint With Model-based Deep Reinforcement Learning
Zhewei Huang, Wen Heng, and Shuchang Zhou. 2019.Learning to Paint With Model-based Deep Reinforcement Learning. arXiv:1903.04411 [cs] doi:10.48550/ arXiv.1903.04411
-
[12]
Dmytro Kotovenko, Matthias Wright, Arthur Heimbrecht, and Bjorn Ommer. 2021. Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes. 12196– 12205. https://openaccess.thecvf.com/content/CVPR2021/html/Kotovenko_ Rethinking_Style_Transfer_From_Pixels_to_Parameterized_Brushstrokes_ CVPR_2021_paper.html
work page 2021
-
[13]
Whitney, Pushmeet Kohli, and Josh Tenenbaum
Tejas D Kulkarni, William F. Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015. Deep Convolutional Inverse Graphics Network. InAdvances in Neural Information Processing Systems(2015), Vol. 28. Curran Associates, Inc. https://proceedings. neurips.cc/paper/2015/hash/ced556cd9f9c0c8315cfbe0744a3baf0-Abstract.html
work page 2015
-
[14]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. 2023. Flow Matching for Generative Modeling. arXiv:2210.02747 [cs.LG] https://arxiv.org/abs/2210.02747
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Songhua Liu, Tianwei Lin, Dongliang He, Fu Li, Ruifeng Deng, Xin Li, Errui Ding, and Hao Wang. 2021. Paint Transformer: Feed Forward Neural Painting With Stroke Prediction. 6598–6607. https://openaccess.thecvf.com/content/ICCV2021/ html/Liu_Paint_Transformer_Feed_Forward_Neural_Painting_With_Stroke_ Prediction_ICCV_2021_paper.html
work page 2021
-
[16]
Matthew M. Loper and Michael J. Black. 2014. OpenDR: An Approximate Differ- entiable Renderer. InComputer Vision – ECCV 2014(Cham, 2014), David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, 154–169. doi:10.1007/978-3-319-10584-0_11
-
[17]
Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, Wan- Teh Chang, Wei Hua, Manfred Georg, and Matthias Grundmann. 2019. MediaPipe: A Framework for Building Perception Pipelines. arXiv:1906.08172 [cs.DC] https: //arxiv.org/abs/1906.08172
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[18]
Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum
-
[19]
InAdvances in Neural Information Processing Systems(2013), Vol
Approximate Bayesian Image Interpretation Using Generative Probabilistic Graphics Programs. InAdvances in Neural Information Processing Systems(2013), Vol. 26. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2013/hash/ fa14d4fe2f19414de3ebd9f63d5c0169-Abstract.html
work page 2013
-
[20]
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72
work page 2025
- [21]
- [22]
-
[23]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 [cs.CV] https://arxiv.org/abs/2112.10752
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Bowei Shao, Richard Adams, Aaron Hertzmann, and Constantine Caramanis. 2024. Inverse Painting: Reconstructing The Painting Process.ACM Transactions on Graphics43, 6 (2024)
work page 2024
-
[25]
Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. 2018. CSGNet: Neural Shape Parser for Constructive Solid Geometry. 5515–
work page 2018
-
[26]
https://openaccess.thecvf.com/content_cvpr_2018/html/Sharma_CSGNet_ Neural_Shape_CVPR_2018_paper.html
-
[27]
Jaskirat Singh, Cameron Smith, Jose Echevarria, and Liang Zheng. 2022. Intelli- Paint: Towards Developing More Human-Intelligible Painting Agents. InComputer Vision – ECCV 2022(Cham, 2022), Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer Nature Switzerland, 685–701. doi:10.1007/978-3-031-19787-1_39
- [28]
-
[29]
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. 2024. Visual autoregressive modeling: Scalable image generation via next-scale prediction. Advances in neural information processing systems37 (2024), 84839–84865
work page 2024
-
[30]
Zhengyan Tong, Xiaohang Wang, Shengchao Yuan, Xuanhong Chen, Junjie Wang, and Xiangzhong Fang. 2022. Im2oil: Stroke-based oil painting rendering with linearly controllable fineness via adaptive sampling. InProceedings of the 30th ACM international conference on multimedia. 1035–1046
work page 2022
-
[31]
Yunnan Wang, Ziqiang Li, Wenyao Zhang, Lexiang Lv, Zequn Zhang, Xiaoyu Shen, Xin Jin, and Wenjun Zeng. 2025. Canvas: Compositional Generation for Art Painting With Seamless Subject-Driven Infusion.IEEE Transactions on Circuits and Systems for Video Technology(2025)
work page 2025
-
[32]
Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation. arXiv:1808.00897 [cs.CV] https://arxiv.org/abs/1808.00897
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
Jiong Zhang, Guangxin Xu, and Xiaoyan Zhang. 2025. HRL-Painter: Optimal Planning Painter Based on Hierarchical Reinforcement Learning. 636 (2025), 129972. doi:10.1016/j.neucom.2025.129972
-
[34]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang
-
[35]
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv:1801.03924 [cs.CV] https://arxiv.org/abs/1801.03924
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Zhengxia Zou, Tianyang Shi, Shuang Qiu, Yi Yuan, and Zhenwei Shi
-
[37]
Stylized Neural Painting. 15689–15698. https://openaccess.thecvf. com/content/CVPR2021/html/Zou_Stylized_Neural_Painting_CVPR_ 2021_paper.html?utm_campaign=The+Batch&%3Butm_source=hs_email& %3Butm_medium=email&%3B_hsenc=p2ANqtz-9szYQCBxaf4USyuvLJ0t_ THBFH_u02CE2YetO5ca8sqHL5fpnjIFadN29wJyKLpyKkkcnF&ref=dl-staging- website.ghost.io , ,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.