pith. sign in

arxiv: 2506.20911 · v2 · submitted 2025-06-26 · 💻 cs.CV

FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Pith reviewed 2026-05-19 08:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-turn image editingneurosymbolic agentsubroutine miningtoolpath planningA* searchlarge language modelsfast-slow planning
0
0 comments X p. Extension

The pith

FaSTA* mines reusable subroutines from past toolpaths so LLMs can handle most multi-turn image edits with fast planning before falling back to A* search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FaSTA*, a neurosymbolic agent for multi-turn image editing tasks that combine several instructions such as object detection, recoloring, and removal. Large language models create high-level subtask plans while A* search finds precise, low-cost sequences of AI tool calls for each subtask. The system extracts common subroutines from earlier successful toolpaths through inductive reasoning and reuses them as new tools in later tasks. This produces an adaptive fast-slow loop in which subroutine selection is attempted first and A* search activates only for unfamiliar subtasks. The result is lower computational cost than recent baselines while success rates stay competitive.

Core claim

By continuously mining and refining symbolic subroutines from successful toolpaths, FaSTA* lets LLMs cover the majority of editing subtasks through fast rule-based selection, activating slow A* search only for novel cases, which reduces overall exploration cost on similar subtasks applied to new images.

What carries the argument

Adaptive fast-slow planning that first tries LLM-selected or generated subroutines mined from prior successes and falls back to per-subtask A* search only on failure.

Load-bearing premise

Large language models can reliably extract and refine subroutines from successful toolpaths that stay correct and reusable across similar images and tasks without introducing errors or losing coverage.

What would settle it

Measure whether disabling subroutine mining on a new collection of multi-turn editing tasks causes computation time to rise sharply or success rate to fall compared with the full FaSTA* system.

Figures

Figures reproduced from arXiv: 2506.20911 by Advait Gupta, Dang Nguyen, Rishie Raj, Tianyi Zhou.

Figure 1
Figure 1. Figure 1: Inductive Reasoning of Reusable Subroutines. Left: Reuse rate (% of applicable subtasks where a subroutine was utilized) of the top-5 learned subroutines. Right: Success rate (%) of fast planning (subroutines only, without A∗ search) on subtasks for a held-out test set of tasks. It increases exponentially as more reusable subroutines are extracted from an increasing number of explored tasks. LLM [PITH_FUL… view at source ↗
Figure 2
Figure 2. Figure 2: Top: Online learning (induction) and refinement of reusable subroutines from explored toolpaths for previous tasks. Bottom: Adaptive fast-slow planning framework in FaSTA∗ . Given a new task, FaSTA∗ first uses an LLM to generate a high-level plan of subtasks and then select a subroutine per subtask, yielding a fast plan. Only when the subroutine’s output does not pass the quality check by VLMs, a slow plan… view at source ↗
Figure 3
Figure 3. Figure 3: Execution time (seconds) per image. FaSTA [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of FaSTA∗ with CoSTA∗ [9] and other leading image editing agents for complex multi-turn tasks. FaSTA∗ achieves visual results identical to CoSTA∗ and significantly surpasses other baselines in accuracy and coherence. Notably, FaSTA∗ delivers this high quality at roughly half the execution cost of CoSTA∗ , highlighting its superior efficiency. average execution cost by over 49.3%, ach… view at source ↗
Figure 5
Figure 5. Figure 5: Cost-Quality Pareto Frontier. FaSTA∗ with various α values against CoSTA∗ and other baselines. FaSTA∗ achieves a superior frontier, offering better cost-quality trade-offs. Pareto Optimality Analysis [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure case for “High-Level Only” execution versus FaSTA [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples of FaSTA∗ ’s performance on sample tasks from the MagicBrush dataset [45]. These tasks were processed using the Subroutine Rule Table learned from non￾benchmark data. Notably, all examples shown were successfully completed by FaSTA∗ relying entirely on its “fast plan” composed of learned subroutines, without needing to resort to the “slow planning” via A* search for any subtask. A Qual… view at source ↗
Figure 8
Figure 8. Figure 8: Example demonstrating FaSTA∗ ’s subroutine effectiveness. FaSTA∗ uses learned rules to select optimal paths (e.g., SD Search&Recolor for the small ball in row 1, avoiding SD Inpaint’s potential failure), achieving results identical to CoSTA∗ at significantly lower average cost (15.21s vs. 25.32s for these examples) by preventing unnecessary exploration of suboptimal paths. prompt conditions matched learned… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of total manipulations (subtask occurrences) across the specialized test datasets [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Illustration of subtask-specific dataset generation. A single base image from the CoSTA [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual example for the object recoloration trace detailed in Appendix [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reuse rate (%) for all learned subroutines. The rate for each subroutine is calculated based [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison of FaSTA∗ against CoSTA∗ [9] and Gemini 2.0 Flash Preview for complex multi-turn editing tasks (inputs on top). FaSTA∗ produces high-quality outputs visually identical to CoSTA∗ and superior to Gemini. Notably, FaSTA∗ achieves this CoSTA∗ -level quality at a significantly reduced execution cost, demonstrating its enhanced efficiency. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
read the original abstract

We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as ``Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent ``FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate. Our code and data can be accessed at https://github.com/tianyi-lab/FaSTAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FaSTA*, a neurosymbolic agent for multi-turn image editing tasks that combines LLM-based high-level subtask planning with per-subtask A* search to generate cost-efficient toolpaths. It adds an inductive subroutine mining step in which LLMs extract and refine reusable symbolic subroutines from previously successful toolpaths, enabling a fast-slow adaptive planner that prefers mined subroutines and falls back to A* only for novel cases. The central claim is that this yields significantly lower computational cost than recent baselines while remaining competitive in success rate.

Significance. If the subroutine mining step produces correct, generalizable subroutines that cover most recurring subtasks, the work would demonstrate a practical way to reduce the expense of repeated A* searches in iterative vision-language tool use, advancing hybrid fast-slow neurosymbolic agents. The public release of code and data at the cited GitHub repository is a clear strength that supports reproducibility.

major comments (2)
  1. [Method (subroutine mining procedure)] The efficiency claim rests on the assumption that LLM inductive reasoning on successful toolpaths yields subroutines that are both correct and sufficiently broad to avoid frequent fallback to A*; no formal verification, manual inspection protocol, or error-rate analysis of the mined subroutines is described, leaving open the possibility that over-generalization or invalid sequences would erode the reported savings.
  2. [Abstract and Experiments section] Abstract and experimental claims state that FaSTA* is 'significantly more computationally efficient' while 'competitive with the state-of-the-art baseline in terms of success rate,' yet the provided text contains no quantitative tables, ablation isolating mining quality, error bars, or per-subtask fallback frequencies; without these data the central efficiency result cannot be verified.
minor comments (2)
  1. [Introduction] The notation distinguishing 'subroutine' from 'toolpath' and 'subtask' should be defined once at first use and used consistently thereafter.
  2. [Method] A brief discussion of how mined subroutines are stored, indexed, and retrieved (e.g., as new callable tools) would improve clarity of the adaptive planner.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional details and data where appropriate.

read point-by-point responses
  1. Referee: [Method (subroutine mining procedure)] The efficiency claim rests on the assumption that LLM inductive reasoning on successful toolpaths yields subroutines that are both correct and sufficiently broad to avoid frequent fallback to A*; no formal verification, manual inspection protocol, or error-rate analysis of the mined subroutines is described, leaving open the possibility that over-generalization or invalid sequences would erode the reported savings.

    Authors: We agree that the manuscript would benefit from explicit validation of the mined subroutines. The current description relies on end-to-end task success to imply subroutine quality. In the revised version, we will add a dedicated subsection describing our manual inspection protocol (sampling 50 mined subroutines and checking for syntactic validity, semantic correctness on held-out images, and generality), report the observed error rate, and include per-task fallback frequencies to A* to quantify subroutine coverage. revision: yes

  2. Referee: [Abstract and Experiments section] Abstract and experimental claims state that FaSTA* is 'significantly more computationally efficient' while 'competitive with the state-of-the-art baseline in terms of success rate,' yet the provided text contains no quantitative tables, ablation isolating mining quality, error bars, or per-subtask fallback frequencies; without these data the central efficiency result cannot be verified.

    Authors: The full manuscript contains comparative results, but we acknowledge the presentation lacks the requested granularity. We will revise the Experiments section to add explicit tables reporting success rates and wall-clock / token costs versus baselines, include error bars from multiple runs, provide an ablation isolating the subroutine mining component, and report per-subtask fallback frequencies. These additions will make the efficiency claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is empirically validated against external baselines

full rationale

The paper presents an architectural combination of LLM high-level planning, A* local search, and inductive subroutine mining from prior successful toolpaths. Efficiency and success-rate claims rest on direct comparisons to recent image-editing baselines rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked; subroutine extraction is described as an independent LLM inductive step whose correctness is assessed externally via overall task performance. The derivation chain therefore remains self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard components (LLMs, A* search) plus the domain assumption that subroutine extraction works reliably; no free parameters or invented entities are explicitly introduced beyond the agent design itself.

axioms (1)
  • domain assumption LLMs can perform reliable inductive reasoning to extract reusable subroutines from successful toolpaths
    Invoked in the description of continuous extraction/refinement of subroutines for future tasks.

pith-pipeline@v0.9.0 · 5833 in / 1227 out tokens · 47187 ms · 2026-05-19T08:13:20.934886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

  1. [1]

    Character region awareness for text detection

    Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 9365–9374. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00959

  2. [2]

    Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

  3. [3]

    Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

  4. [4]

    Training-free layout control with cross-attention guidance, 2023

    Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance, 2023

  5. [5]

    Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes

    Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use, 2024. URL https://arxiv.org/abs/2410.14923

  6. [6]

    DeepCache: Accelerating Diffusion Models for Free,

    Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. CLOV A: A closed-loop visual assistant with tool usage and update. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13258–13268. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01259

  7. [7]

    Google Cloud Vision API, 2024

    Google Cloud. Google Cloud Vision API, 2024. URLhttps://cloud.google.com/vision. Accessed: January 29, 2025

  8. [8]

    Tora: A tool-integrated reasoning agent for mathematical problem solving,

    Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving,

  9. [9]

    URL https://arxiv.org/abs/2309.17452

  10. [11]

    Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025

    Advait Gupta, NandaKiran Velaga, Dang Nguyen, and Tianyi Zhou. Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025. URL https://arxiv.org/abs/2503. 10613

  11. [12]

    Implicit occupancy flow fields for perception and prediction in self-driving

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 14953–14962. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01436

  12. [13]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022. URL https://arxiv. org/abs/2208.01626

  13. [14]

    Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

    Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

  14. [15]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022. URL https: //arxiv.org/abs/2201.07207

  15. [16]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey, 2024. URLhttps://arxiv.org/abs/2402.02716

  16. [17]

    Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

    Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023. 10

  17. [18]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018

  18. [19]

    In2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023. doi: 10.1109/ ICCV510...

  19. [20]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

  20. [21]

    cwittwer/easyocr: Easyocr, July 2022

    Rakpong Kittinaradorn, Wisuttida Wichitwong, Nart Tlisha, Sumitkumar Sarda, Jeff Potter, Sam_S, Arkya Bagchi, ronaldaug, Nina, Vijayabhaskar, DaeJeong Mun, Mejans, Amit Agarwal, Mijoo Kim, A2va, Abderrahim Mama, Korakot Chaovavanich, Loay, Karol Kucza, Vladimir Gurevich, Márton Tim, Abduroid, Bereket Abraham, Giovani Moutinho, milosjovac, Mo- hamed Rashad...

  21. [22]

    Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

    Orest Kupyn, V olodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

  22. [23]

    A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024

    Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024. URL https://arxiv.org/abs/2406.05804

  23. [24]

    Gligen: Open-set grounded text-to-image generation, 2023

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation, 2023

  24. [25]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

  25. [26]

    Llms are in-context bandit reinforcement learners, 2025

    Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. Llms are in-context bandit reinforcement learners, 2025. URL https://arxiv.org/abs/2410.05362

  26. [27]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022. URL https://arxiv.org/abs/2112.10741

  27. [28]

    Tool learning with large language models: a survey

    Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. Tool learning with large language models: a survey. Frontiers of Computer Science, 19(8), January 2025. ISSN 2095-2236. doi: 10.1007/s11704-024-40678-2. URL http://dx.doi.org/10.1007/s11704-024-40678-2

  28. [29]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machi...

  29. [30]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021

  30. [31]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

  31. [32]

    High-resolution Image Synthesis with Latent Diffusion Models,

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01042. 11

  32. [33]

    High-resolution image synthesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

  33. [34]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

  34. [35]

    Small llms are weak tool learners: A multi-llm agent, 2024

    Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent, 2024. URL https://arxiv.org/abs/2401.07324

  35. [36]

    A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025

    Zineb Sordo, Eric Chagnon, and Daniela Ushizima. A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025. URL https://arxiv.org/abs/2502.21151

  36. [37]

    Sketch-guided text-to-image diffusion models, 2022

    Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models, 2022

  37. [38]

    Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

    Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

  38. [39]

    Real-esrgan: Training real-world blind super-resolution with pure synthetic data

    Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, QC, Canada, October 11-17, 2021, pages 1905–1914. IEEE, 2021. doi: 10.1109/ICCVW54120.2021.00217

  39. [40]

    Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image, 2015

  40. [41]

    Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

    Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

  41. [42]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

  42. [43]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

  43. [44]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023. URL https://arxiv.org/abs/2303.11381

  44. [45]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

  45. [46]

    Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

    Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

  46. [47]

    Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024

    Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024. URL https://arxiv.org/abs/2401.07339

  47. [48]

    High-Level Only

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 12 Instruction Replace the cat with rabbit High Level Only FaSTA* Figure 6: Failure case for “High-Level Only” execution versus FaSTA∗. For the task “Replace the cat with rabbit”, the initially selected high-level subroutine fails to produce a ...

  48. [49]

    remove car,

    Task Decomposition and Subtask-Tree Generation Given an input image x and a natural language instruction u, CoSTA∗ employs an LLM to decompose the complex request into a sequence of more manageable subtasks. This decomposition results in a subtask tree, Gss = (Vss, Ess). • Each node vi ∈ Vss corresponds to a specific subtask si (e.g., "remove car," "recol...

  49. [50]

    • For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si

    Tool Subgraph Construction The abstract subtask tree Gss is then translated into a concrete Tool Subgraph Gts = (Vts, Ets), which is the actual graph the A∗ search will operate on. • For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si. • The TDG is then used to backtrack from these tools to include all nece...

  50. [51]

    subtask chain,

    Cost-Sensitive A∗ Search for Optimal Toolpath CoSTA∗ employs an A∗ search algorithm on the Tool Subgraph Gts to find an optimal toolpath that balances execution cost and output quality, according to a user-defined trade-off parameter α. • Priority Function: The A∗ search prioritizes nodes (representing tool executions) based on the function f(x) = g(x) + ...

  51. [52]

    A text prompt describing the editing task

  52. [53]

    Detect the pedestrians, remove the car and replacement the cat with rabbit and recolor the dog to pink

    A predefined list of subtasks the model supports (provided below). N.5 Supported Subtasks Here is the complete list of subtasks available for constructing the subtask chain: Object Detection, Object Segmentation, Object Addition, Object Removal, Background Removal, Landmark Detection, Object Replacement, Image Upscaling, Image Captioning, Changing Scenery...