FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

arxiv: 2506.20911 · v2 · submitted 2025-06-26 · 💻 cs.CV

FaSTA^*: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing

Advait Gupta , Rishie Raj , Dang Nguyen , Tianyi Zhou This is my paper

Pith reviewed 2026-05-19 08:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-turn image editingneurosymbolic agentsubroutine miningtoolpath planningA* searchlarge language modelsfast-slow planning

0 comments p. Extension

The pith

FaSTA* mines reusable subroutines from past toolpaths so LLMs can handle most multi-turn image edits with fast planning before falling back to A* search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FaSTA*, a neurosymbolic agent for multi-turn image editing tasks that combine several instructions such as object detection, recoloring, and removal. Large language models create high-level subtask plans while A* search finds precise, low-cost sequences of AI tool calls for each subtask. The system extracts common subroutines from earlier successful toolpaths through inductive reasoning and reuses them as new tools in later tasks. This produces an adaptive fast-slow loop in which subroutine selection is attempted first and A* search activates only for unfamiliar subtasks. The result is lower computational cost than recent baselines while success rates stay competitive.

Core claim

By continuously mining and refining symbolic subroutines from successful toolpaths, FaSTA* lets LLMs cover the majority of editing subtasks through fast rule-based selection, activating slow A* search only for novel cases, which reduces overall exploration cost on similar subtasks applied to new images.

What carries the argument

Adaptive fast-slow planning that first tries LLM-selected or generated subroutines mined from prior successes and falls back to per-subtask A* search only on failure.

Load-bearing premise

Large language models can reliably extract and refine subroutines from successful toolpaths that stay correct and reusable across similar images and tasks without introducing errors or losing coverage.

What would settle it

Measure whether disabling subroutine mining on a new collection of multi-turn editing tasks causes computation time to rise sharply or success rate to fall compared with the full FaSTA* system.

Figures

Figures reproduced from arXiv: 2506.20911 by Advait Gupta, Dang Nguyen, Rishie Raj, Tianyi Zhou.

**Figure 1.** Figure 1: Inductive Reasoning of Reusable Subroutines. Left: Reuse rate (% of applicable subtasks where a subroutine was utilized) of the top-5 learned subroutines. Right: Success rate (%) of fast planning (subroutines only, without A∗ search) on subtasks for a held-out test set of tasks. It increases exponentially as more reusable subroutines are extracted from an increasing number of explored tasks. LLM [PITH_FUL… view at source ↗

**Figure 2.** Figure 2: Top: Online learning (induction) and refinement of reusable subroutines from explored toolpaths for previous tasks. Bottom: Adaptive fast-slow planning framework in FaSTA∗ . Given a new task, FaSTA∗ first uses an LLM to generate a high-level plan of subtasks and then select a subroutine per subtask, yielding a fast plan. Only when the subroutine’s output does not pass the quality check by VLMs, a slow plan… view at source ↗

**Figure 3.** Figure 3: Execution time (seconds) per image. FaSTA [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of FaSTA∗ with CoSTA∗ [9] and other leading image editing agents for complex multi-turn tasks. FaSTA∗ achieves visual results identical to CoSTA∗ and significantly surpasses other baselines in accuracy and coherence. Notably, FaSTA∗ delivers this high quality at roughly half the execution cost of CoSTA∗ , highlighting its superior efficiency. average execution cost by over 49.3%, ach… view at source ↗

**Figure 5.** Figure 5: Cost-Quality Pareto Frontier. FaSTA∗ with various α values against CoSTA∗ and other baselines. FaSTA∗ achieves a superior frontier, offering better cost-quality trade-offs. Pareto Optimality Analysis [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Failure case for “High-Level Only” execution versus FaSTA [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples of FaSTA∗ ’s performance on sample tasks from the MagicBrush dataset [45]. These tasks were processed using the Subroutine Rule Table learned from nonbenchmark data. Notably, all examples shown were successfully completed by FaSTA∗ relying entirely on its “fast plan” composed of learned subroutines, without needing to resort to the “slow planning” via A* search for any subtask. A Qual… view at source ↗

**Figure 8.** Figure 8: Example demonstrating FaSTA∗ ’s subroutine effectiveness. FaSTA∗ uses learned rules to select optimal paths (e.g., SD Search&Recolor for the small ball in row 1, avoiding SD Inpaint’s potential failure), achieving results identical to CoSTA∗ at significantly lower average cost (15.21s vs. 25.32s for these examples) by preventing unnecessary exploration of suboptimal paths. prompt conditions matched learned… view at source ↗

**Figure 9.** Figure 9: Distribution of total manipulations (subtask occurrences) across the specialized test datasets [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Illustration of subtask-specific dataset generation. A single base image from the CoSTA [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Visual example for the object recoloration trace detailed in Appendix [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Reuse rate (%) for all learned subroutines. The rate for each subroutine is calculated based [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of FaSTA∗ against CoSTA∗ [9] and Gemini 2.0 Flash Preview for complex multi-turn editing tasks (inputs on top). FaSTA∗ produces high-quality outputs visually identical to CoSTA∗ and superior to Gemini. Notably, FaSTA∗ achieves this CoSTA∗ -level quality at a significantly reduced execution cost, demonstrating its enhanced efficiency. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

read the original abstract

We develop a cost-efficient neurosymbolic agent to address challenging multi-turn image editing tasks such as ``Detect the bench in the image while recoloring it to pink. Also, remove the cat for a clearer view and recolor the wall to yellow.'' It combines the fast, high-level subtask planning by large language models (LLMs) with the slow, accurate, tool-use, and local A$^*$ search per subtask to find a cost-efficient toolpath -- a sequence of calls to AI tools. To save the cost of A$^*$ on similar subtasks, we perform inductive reasoning on previously successful toolpaths via LLMs to continuously extract/refine frequently used subroutines and reuse them as new tools for future tasks in an adaptive fast-slow planning, where the higher-level subroutines are explored first, and only when they fail, the low-level A$^*$ search is activated. The reusable symbolic subroutines considerably save exploration cost on the same types of subtasks applied to similar images, yielding a human-like fast-slow toolpath agent ``FaSTA$^*$'': fast subtask planning followed by rule-based subroutine selection per subtask is attempted by LLMs at first, which is expected to cover most tasks, while slow A$^*$ search is only triggered for novel and challenging subtasks. By comparing with recent image editing approaches, we demonstrate FaSTA$^*$ is significantly more computationally efficient while remaining competitive with the state-of-the-art baseline in terms of success rate. Our code and data can be accessed at https://github.com/tianyi-lab/FaSTAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FaSTA* mixes LLM subtask planning with A* toolpath search and LLM-mined subroutines to cut compute in multi-turn image editing, but the mining step's reliability is the part that needs checking.

read the letter

This paper's main contribution is a neurosymbolic setup for multi-turn image editing agents that uses LLMs for quick subtask planning and subroutine mining, with A* search as a fallback for harder cases. The integration of fast LLM planning, slow A* per subtask, and continuous extraction of subroutines from toolpaths stands out as a practical synthesis. It aims to mimic human-like efficiency by reusing common sequences instead of searching from scratch every time. They back this with comparisons showing lower compute costs and competitive success rates against recent methods. Releasing code and data is a plus for reproducibility. The weakest part is the assumption that LLM-based inductive reasoning on past toolpaths will produce reliable, reusable subroutines. LLMs might overfit to seen cases or hallucinate invalid sequences, which could erode the efficiency gains or drop performance on new tasks. The provided abstract lacks specific numbers on subroutine usage frequency, error rates in mining, or ablations isolating that component, so the central claim is difficult to assess fully from the summary alone. This work targets researchers in computer vision and AI agents focused on tool use and cost reduction. A reader interested in hybrid planning methods would get useful ideas here. It deserves serious peer review to verify the implementation and results in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FaSTA*, a neurosymbolic agent for multi-turn image editing tasks that combines LLM-based high-level subtask planning with per-subtask A* search to generate cost-efficient toolpaths. It adds an inductive subroutine mining step in which LLMs extract and refine reusable symbolic subroutines from previously successful toolpaths, enabling a fast-slow adaptive planner that prefers mined subroutines and falls back to A* only for novel cases. The central claim is that this yields significantly lower computational cost than recent baselines while remaining competitive in success rate.

Significance. If the subroutine mining step produces correct, generalizable subroutines that cover most recurring subtasks, the work would demonstrate a practical way to reduce the expense of repeated A* searches in iterative vision-language tool use, advancing hybrid fast-slow neurosymbolic agents. The public release of code and data at the cited GitHub repository is a clear strength that supports reproducibility.

major comments (2)

[Method (subroutine mining procedure)] The efficiency claim rests on the assumption that LLM inductive reasoning on successful toolpaths yields subroutines that are both correct and sufficiently broad to avoid frequent fallback to A*; no formal verification, manual inspection protocol, or error-rate analysis of the mined subroutines is described, leaving open the possibility that over-generalization or invalid sequences would erode the reported savings.
[Abstract and Experiments section] Abstract and experimental claims state that FaSTA* is 'significantly more computationally efficient' while 'competitive with the state-of-the-art baseline in terms of success rate,' yet the provided text contains no quantitative tables, ablation isolating mining quality, error bars, or per-subtask fallback frequencies; without these data the central efficiency result cannot be verified.

minor comments (2)

[Introduction] The notation distinguishing 'subroutine' from 'toolpath' and 'subtask' should be defined once at first use and used consistently thereafter.
[Method] A brief discussion of how mined subroutines are stored, indexed, and retrieved (e.g., as new callable tools) would improve clarity of the adaptive planner.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional details and data where appropriate.

read point-by-point responses

Referee: [Method (subroutine mining procedure)] The efficiency claim rests on the assumption that LLM inductive reasoning on successful toolpaths yields subroutines that are both correct and sufficiently broad to avoid frequent fallback to A*; no formal verification, manual inspection protocol, or error-rate analysis of the mined subroutines is described, leaving open the possibility that over-generalization or invalid sequences would erode the reported savings.

Authors: We agree that the manuscript would benefit from explicit validation of the mined subroutines. The current description relies on end-to-end task success to imply subroutine quality. In the revised version, we will add a dedicated subsection describing our manual inspection protocol (sampling 50 mined subroutines and checking for syntactic validity, semantic correctness on held-out images, and generality), report the observed error rate, and include per-task fallback frequencies to A* to quantify subroutine coverage. revision: yes
Referee: [Abstract and Experiments section] Abstract and experimental claims state that FaSTA* is 'significantly more computationally efficient' while 'competitive with the state-of-the-art baseline in terms of success rate,' yet the provided text contains no quantitative tables, ablation isolating mining quality, error bars, or per-subtask fallback frequencies; without these data the central efficiency result cannot be verified.

Authors: The full manuscript contains comparative results, but we acknowledge the presentation lacks the requested granularity. We will revise the Experiments section to add explicit tables reporting success rates and wall-clock / token costs versus baselines, include error bars from multiple runs, provide an ablation isolating the subroutine mining component, and report per-subtask fallback frequencies. These additions will make the efficiency claims directly verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity detected; method is empirically validated against external baselines

full rationale

The paper presents an architectural combination of LLM high-level planning, A* local search, and inductive subroutine mining from prior successful toolpaths. Efficiency and success-rate claims rest on direct comparisons to recent image-editing baselines rather than any mathematical derivation, fitted parameter renamed as prediction, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked; subroutine extraction is described as an independent LLM inductive step whose correctness is assessed externally via overall task performance. The derivation chain therefore remains self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard components (LLMs, A* search) plus the domain assumption that subroutine extraction works reliably; no free parameters or invented entities are explicitly introduced beyond the agent design itself.

axioms (1)

domain assumption LLMs can perform reliable inductive reasoning to extract reusable subroutines from successful toolpaths
Invoked in the description of continuous extraction/refinement of subroutines for future tasks.

pith-pipeline@v0.9.0 · 5833 in / 1227 out tokens · 47187 ms · 2026-05-19T08:13:20.934886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

[1]

Character region awareness for text detection

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 9365–9374. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00959

work page doi:10.1109/cvpr.2019.00959 2019
[2]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

work page 2023
[3]

Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

work page 2023
[4]

Training-free layout control with cross-attention guidance, 2023

Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance, 2023

work page 2023
[5]

Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use, 2024. URL https://arxiv.org/abs/2410.14923

work page arXiv 2024
[6]

DeepCache: Accelerating Diffusion Models for Free,

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. CLOV A: A closed-loop visual assistant with tool usage and update. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13258–13268. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01259

work page doi:10.1109/cvpr52733.2024.01259 2024
[7]

Google Cloud Vision API, 2024

Google Cloud. Google Cloud Vision API, 2024. URLhttps://cloud.google.com/vision. Accessed: January 29, 2025

work page 2024
[8]

Tora: A tool-integrated reasoning agent for mathematical problem solving,

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving,

work page
[9]

URL https://arxiv.org/abs/2309.17452

work page internal anchor Pith review arXiv
[11]

Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025

Advait Gupta, NandaKiran Velaga, Dang Nguyen, and Tianyi Zhou. Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025. URL https://arxiv.org/abs/2503. 10613

work page 2025
[12]

Implicit occupancy flow fields for perception and prediction in self-driving

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 14953–14962. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01436

work page doi:10.1109/cvpr52729.2023.01436 2023
[13]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022. URL https://arxiv. org/abs/2208.01626

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

work page 2024
[15]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022. URL https: //arxiv.org/abs/2201.07207

work page arXiv 2022
[16]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey, 2024. URLhttps://arxiv.org/abs/2402.02716

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023. 10

work page 2023
[18]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018

work page 2018
[19]

In2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023. doi: 10.1109/ ICCV510...

work page arXiv 2023
[20]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

work page 2023
[21]

cwittwer/easyocr: Easyocr, July 2022

Rakpong Kittinaradorn, Wisuttida Wichitwong, Nart Tlisha, Sumitkumar Sarda, Jeff Potter, Sam_S, Arkya Bagchi, ronaldaug, Nina, Vijayabhaskar, DaeJeong Mun, Mejans, Amit Agarwal, Mijoo Kim, A2va, Abderrahim Mama, Korakot Chaovavanich, Loay, Karol Kucza, Vladimir Gurevich, Márton Tim, Abduroid, Bereket Abraham, Giovani Moutinho, milosjovac, Mo- hamed Rashad...

work page 2022
[22]

Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

Orest Kupyn, V olodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

work page 2018
[23]

A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024

Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024. URL https://arxiv.org/abs/2406.05804

work page arXiv 2024
[24]

Gligen: Open-set grounded text-to-image generation, 2023

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation, 2023

work page 2023
[25]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

work page 2024
[26]

Llms are in-context bandit reinforcement learners, 2025

Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. Llms are in-context bandit reinforcement learners, 2025. URL https://arxiv.org/abs/2410.05362

work page arXiv 2025
[27]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022. URL https://arxiv.org/abs/2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Tool learning with large language models: a survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. Tool learning with large language models: a survey. Frontiers of Computer Science, 19(8), January 2025. ISSN 2095-2236. doi: 10.1007/s11704-024-40678-2. URL http://dx.doi.org/10.1007/s11704-024-40678-2

work page doi:10.1007/s11704-024-40678-2 2025
[29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machi...

work page 2021
[30]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

work page 2020
[32]

High-resolution Image Synthesis with Latent Diffusion Models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01042. 11

work page doi:10.1109/cvpr52688.2022.01042 2022
[33]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022
[34]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

work page 2022
[35]

Small llms are weak tool learners: A multi-llm agent, 2024

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent, 2024. URL https://arxiv.org/abs/2401.07324

work page arXiv 2024
[36]

A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025

Zineb Sordo, Eric Chagnon, and Daniela Ushizima. A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025. URL https://arxiv.org/abs/2502.21151

work page arXiv 2025
[37]

Sketch-guided text-to-image diffusion models, 2022

Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models, 2022

work page 2022
[38]

Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

work page 2022
[39]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, QC, Canada, October 11-17, 2021, pages 1905–1914. IEEE, 2021. doi: 10.1109/ICCVW54120.2021.00217

work page doi:10.1109/iccvw54120.2021.00217 2021
[40]

Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image, 2015

work page 2015
[41]

Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

work page 2024
[42]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023. URL https://arxiv.org/abs/2303.11381

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

work page 2024
[47]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024. URL https://arxiv.org/abs/2401.07339

work page arXiv 2024
[48]

High-Level Only

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 12 Instruction Replace the cat with rabbit High Level Only FaSTA* Figure 6: Failure case for “High-Level Only” execution versus FaSTA∗. For the task “Replace the cat with rabbit”, the initially selected high-level subroutine fails to produce a ...

work page 2023
[49]

remove car,

Task Decomposition and Subtask-Tree Generation Given an input image x and a natural language instruction u, CoSTA∗ employs an LLM to decompose the complex request into a sequence of more manageable subtasks. This decomposition results in a subtask tree, Gss = (Vss, Ess). • Each node vi ∈ Vss corresponds to a specific subtask si (e.g., "remove car," "recol...

work page
[50]

• For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si

Tool Subgraph Construction The abstract subtask tree Gss is then translated into a concrete Tool Subgraph Gts = (Vts, Ets), which is the actual graph the A∗ search will operate on. • For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si. • The TDG is then used to backtrack from these tools to include all nece...

work page
[51]

subtask chain,

Cost-Sensitive A∗ Search for Optimal Toolpath CoSTA∗ employs an A∗ search algorithm on the Tool Subgraph Gts to find an optimal toolpath that balances execution cost and output quality, according to a user-defined trade-off parameter α. • Priority Function: The A∗ search prioritizes nodes (representing tool executions) based on the function f(x) = g(x) + ...

work page
[52]

A text prompt describing the editing task

work page
[53]

Detect the pedestrians, remove the car and replacement the cat with rabbit and recolor the dog to pink

A predefined list of subtasks the model supports (provided below). N.5 Supported Subtasks Here is the complete list of subtasks available for constructing the subtask chain: Object Detection, Object Segmentation, Object Addition, Object Removal, Background Removal, Landmark Detection, Object Replacement, Image Upscaling, Image Captioning, Changing Scenery...

work page

[1] [1]

Character region awareness for text detection

Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. Character region awareness for text detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019 , pages 9365–9374. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00959

work page doi:10.1109/cvpr.2019.00959 2019

[2] [2]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions, 2023

work page 2023

[3] [3]

Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023

work page 2023

[4] [4]

Training-free layout control with cross-attention guidance, 2023

Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance, 2023

work page 2023

[5] [5]

Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes

Xiaohan Fu, Shuheng Li, Zihan Wang, Yihao Liu, Rajesh K. Gupta, Taylor Berg-Kirkpatrick, and Earlence Fernandes. Imprompter: Tricking llm agents into improper tool use, 2024. URL https://arxiv.org/abs/2410.14923

work page arXiv 2024

[6] [6]

DeepCache: Accelerating Diffusion Models for Free,

Zhi Gao, Yuntao Du, Xintong Zhang, Xiaojian Ma, Wenjuan Han, Song-Chun Zhu, and Qing Li. CLOV A: A closed-loop visual assistant with tool usage and update. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13258–13268. IEEE, 2024. doi: 10.1109/CVPR52733.2024.01259

work page doi:10.1109/cvpr52733.2024.01259 2024

[7] [7]

Google Cloud Vision API, 2024

Google Cloud. Google Cloud Vision API, 2024. URLhttps://cloud.google.com/vision. Accessed: January 29, 2025

work page 2024

[8] [8]

Tora: A tool-integrated reasoning agent for mathematical problem solving,

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen. Tora: A tool-integrated reasoning agent for mathematical problem solving,

work page

[9] [9]

URL https://arxiv.org/abs/2309.17452

work page internal anchor Pith review arXiv

[10] [11]

Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025

Advait Gupta, NandaKiran Velaga, Dang Nguyen, and Tianyi Zhou. Costa ∗: Cost-sensitive toolpath agent for multi-turn image editing, 2025. URL https://arxiv.org/abs/2503. 10613

work page 2025

[11] [12]

Implicit occupancy flow fields for perception and prediction in self-driving

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 14953–14962. IEEE, 2023. doi: 10.1109/CVPR52729.2023.01436

work page doi:10.1109/cvpr52729.2023.01436 2023

[12] [13]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022. URL https://arxiv. org/abs/2208.01626

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [14]

Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

Minbin Huang, Yanxin Long, Xinchi Deng, Ruihang Chu, Jiangfeng Xiong, Xiaodan Liang, Hong Cheng, Qinglin Lu, and Wei Liu. Dialoggen: Multi-modal interactive dialogue system for multi-turn text-to-image generation, 2024

work page 2024

[14] [15]

Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, 2022. URL https: //arxiv.org/abs/2201.07207

work page arXiv 2022

[15] [16]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey, 2024. URLhttps://arxiv.org/abs/2402.02716

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [17]

Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023

Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, and Ying Shan. Smartedit: Exploring complex instruction-based image editing with multimodal large language models, 2023. 10

work page 2023

[17] [18]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks, 2018

work page 2018

[18] [19]

In2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloé Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross B. Girshick. Segment anything. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3992–4003. IEEE, 2023. doi: 10.1109/ ICCV510...

work page arXiv 2023

[19] [20]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023

work page 2023

[20] [21]

cwittwer/easyocr: Easyocr, July 2022

Rakpong Kittinaradorn, Wisuttida Wichitwong, Nart Tlisha, Sumitkumar Sarda, Jeff Potter, Sam_S, Arkya Bagchi, ronaldaug, Nina, Vijayabhaskar, DaeJeong Mun, Mejans, Amit Agarwal, Mijoo Kim, A2va, Abderrahim Mama, Korakot Chaovavanich, Loay, Karol Kucza, Vladimir Gurevich, Márton Tim, Abduroid, Bereket Abraham, Giovani Moutinho, milosjovac, Mo- hamed Rashad...

work page 2022

[21] [22]

Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

Orest Kupyn, V olodymyr Budzan, Mykola Mykhailych, Dmytro Mishkin, and Jiri Matas. Deblurgan: Blind motion deblurring using conditional adversarial networks, 2018

work page 2018

[22] [23]

A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024

Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use (including rag), planning, and feedback learning, 2024. URL https://arxiv.org/abs/2406.05804

work page arXiv 2024

[23] [24]

Gligen: Open-set grounded text-to-image generation, 2023

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation, 2023

work page 2023

[24] [25]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2024

work page 2024

[25] [26]

Llms are in-context bandit reinforcement learners, 2025

Giovanni Monea, Antoine Bosselut, Kianté Brantley, and Yoav Artzi. Llms are in-context bandit reinforcement learners, 2025. URL https://arxiv.org/abs/2410.05362

work page arXiv 2025

[26] [27]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models, 2022. URL https://arxiv.org/abs/2112.10741

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [28]

Tool learning with large language models: a survey

Changle Qu, Sunhao Dai, Xiaochi Wei, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Jun Xu, and Ji-rong Wen. Tool learning with large language models: a survey. Frontiers of Computer Science, 19(8), January 2025. ISSN 2095-2236. doi: 10.1007/s11704-024-40678-2. URL http://dx.doi.org/10.1007/s11704-024-40678-2

work page doi:10.1007/s11704-024-40678-2 2025

[28] [29]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machi...

work page 2021

[29] [30]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. CoRR, abs/2102.12092, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [31]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer, 2020

work page 2020

[31] [32]

High-resolution Image Synthesis with Latent Diffusion Models,

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 10674–10685. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01042. 11

work page doi:10.1109/cvpr52688.2022.01042 2022

[32] [33]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022

work page 2022

[33] [34]

Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

work page 2022

[34] [35]

Small llms are weak tool learners: A multi-llm agent, 2024

Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent, 2024. URL https://arxiv.org/abs/2401.07324

work page arXiv 2024

[35] [36]

A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025

Zineb Sordo, Eric Chagnon, and Daniela Ushizima. A review on generative ai for text-to- image and image-to-image generation and implications to scientific images, 2025. URL https://arxiv.org/abs/2502.21151

work page arXiv 2025

[36] [37]

Sketch-guided text-to-image diffusion models, 2022

Andrey V oynov, Kfir Aberman, and Daniel Cohen-Or. Sketch-guided text-to-image diffusion models, 2022

work page 2022

[37] [38]

Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

Chien-Yao Wang, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. Yolov7: Trainable bag-of- freebies sets new state-of-the-art for real-time object detectors, 2022

work page 2022

[38] [39]

Real-esrgan: Training real-world blind super-resolution with pure synthetic data

Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW 2021, Montreal, QC, Canada, October 11-17, 2021, pages 1905–1914. IEEE, 2021. doi: 10.1109/ICCVW54120.2021.00217

work page doi:10.1109/iccvw54120.2021.00217 2021

[39] [40]

Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. Deepfont: Identify your font from an image, 2015

work page 2015

[40] [41]

Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing, 2024

work page 2024

[41] [42]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023. URL https://arxiv.org/abs/2201.11903

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [43]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [44]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023. URL https://arxiv.org/abs/2303.11381

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [45]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [46]

Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing, 2024

work page 2024

[46] [47]

Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024

Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. Codeagent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges, 2024. URL https://arxiv.org/abs/2401.07339

work page arXiv 2024

[47] [48]

High-Level Only

Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023. 12 Instruction Replace the cat with rabbit High Level Only FaSTA* Figure 6: Failure case for “High-Level Only” execution versus FaSTA∗. For the task “Replace the cat with rabbit”, the initially selected high-level subroutine fails to produce a ...

work page 2023

[48] [49]

remove car,

Task Decomposition and Subtask-Tree Generation Given an input image x and a natural language instruction u, CoSTA∗ employs an LLM to decompose the complex request into a sequence of more manageable subtasks. This decomposition results in a subtask tree, Gss = (Vss, Ess). • Each node vi ∈ Vss corresponds to a specific subtask si (e.g., "remove car," "recol...

work page

[49] [50]

• For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si

Tool Subgraph Construction The abstract subtask tree Gss is then translated into a concrete Tool Subgraph Gts = (Vts, Ets), which is the actual graph the A∗ search will operate on. • For each subtask nodesi in Gss, the MDT is consulted to find all toolsM(si) capable of performing si. • The TDG is then used to backtrack from these tools to include all nece...

work page

[50] [51]

subtask chain,

Cost-Sensitive A∗ Search for Optimal Toolpath CoSTA∗ employs an A∗ search algorithm on the Tool Subgraph Gts to find an optimal toolpath that balances execution cost and output quality, according to a user-defined trade-off parameter α. • Priority Function: The A∗ search prioritizes nodes (representing tool executions) based on the function f(x) = g(x) + ...

work page

[51] [52]

A text prompt describing the editing task

work page

[52] [53]

Detect the pedestrians, remove the car and replacement the cat with rabbit and recolor the dog to pink

A predefined list of subtasks the model supports (provided below). N.5 Supported Subtasks Here is the complete list of subtasks available for constructing the subtask chain: Object Detection, Object Segmentation, Object Addition, Object Removal, Background Removal, Landmark Detection, Object Replacement, Image Upscaling, Image Captioning, Changing Scenery...

work page