Recognition: 2 theorem links
· Lean TheoremVisual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Pith reviewed 2026-05-13 22:46 UTC · model grok-4.3
The pith
Visual ChatGPT lets users chat with images by linking ChatGPT to visual foundation models through prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By designing prompts that inject information about multiple visual foundation models into ChatGPT, the system enables users to send images, pose complex visual questions, issue multi-step editing instructions, and receive corrected results through iterative feedback, all while ChatGPT orchestrates the collaboration of the visual models.
What carries the argument
A series of prompts that describe each visual model's inputs, outputs, and feedback requirements so ChatGPT can select and sequence them for multi-step visual tasks.
If this is right
- Users can iterate on images by sending feedback and receiving revised outputs in the same conversation.
- Complex visual questions become solvable by breaking them into steps that use different models for generation, detection, or editing.
- The same prompt approach can be applied to other visual models as they are developed.
- ChatGPT gains the ability to output images directly instead of only describing them.
Where Pith is reading between the lines
- The method could generalize to audio or video models by writing similar capability descriptions.
- It suggests prompt engineering might serve as a lightweight way to add new modalities to existing language models.
- Future systems might combine this with user-provided examples to further reduce selection mistakes.
Load-bearing premise
ChatGPT will correctly decompose visual requests and pick the right sequence of models from the prompt descriptions without frequent errors.
What would settle it
A set of ten multi-step visual editing instructions where the system repeatedly chooses the wrong model or fails to chain steps correctly would show the prompt injection does not produce reliable collaboration.
read the original abstract
ChatGPT is attracting a cross-field interest as it provides a language interface with remarkable conversational competency and reasoning capabilities across many domains. However, since ChatGPT is trained with languages, it is currently not capable of processing or generating images from the visual world. At the same time, Visual Foundation Models, such as Visual Transformers or Stable Diffusion, although showing great visual understanding and generation capabilities, they are only experts on specific tasks with one-round fixed inputs and outputs. To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps. 3) providing feedback and asking for corrected results. We design a series of prompts to inject the visual model information into ChatGPT, considering models of multiple inputs/outputs and models that require visual feedback. Experiments show that Visual ChatGPT opens the door to investigating the visual roles of ChatGPT with the help of Visual Foundation Models. Our system is publicly available at \url{https://github.com/microsoft/visual-chatgpt}.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Visual ChatGPT, a system integrating ChatGPT with multiple Visual Foundation Models (e.g., for image understanding and generation) via prompt engineering. This enables users to send/receive images, pose complex multi-step visual queries requiring model collaboration, and provide feedback for iterative corrections. The work is demonstrated through qualitative interaction examples and released as open-source code.
Significance. If the system performs as described, it provides a practical demonstration of extending conversational language models to visual domains through orchestration of existing foundation models. This could accelerate research on multimodal interfaces. The public code release supports reproducibility and further experimentation, which is a clear strength for a system paper.
major comments (1)
- [Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.
minor comments (1)
- [Method] The description of input/output typing and feedback loops in the prompt design could be expanded with a concrete example or pseudocode to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive summary of our work and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: claims of effective multi-step collaboration and reliable model orchestration rest entirely on qualitative examples. No quantitative metrics, ablation studies on prompt variants, error rates for task decomposition, or failure analysis are provided, which limits assessment of the prompt-injection approach's robustness.
Authors: We agree that the current manuscript presents its results exclusively through qualitative interaction examples. Visual ChatGPT is positioned as a system demonstration paper whose primary contribution is the prompt-engineering framework that enables ChatGPT to orchestrate multiple visual foundation models for multi-turn, multi-modal tasks. Because the tasks are open-ended and conversational, standard quantitative metrics (e.g., task-completion accuracy or error rates) are difficult to define without introducing arbitrary task taxonomies that would themselves require extensive validation. We therefore did not include ablations or numerical benchmarks in the original submission. In the revised version we will add a dedicated subsection under Experiments that (1) enumerates representative failure modes we observed during development, (2) describes the prompt-design heuristics we adopted and why certain variants were rejected, and (3) provides a qualitative robustness analysis based on the released code. This addition will make the limitations of the prompt-injection approach more transparent while preserving the paper’s system-oriented scope. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents an engineering system for integrating existing visual foundation models with ChatGPT through prompt injection and multi-step orchestration. No mathematical derivations, fitted parameters, or predictions appear in the text. The core contribution is the prompt design and architecture, supported by qualitative examples and public code release rather than any self-referential theorem, uniqueness claim, or input-to-output reduction. All components rely on external pre-trained models without load-bearing self-citations or ansatz smuggling.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ChatGPT can effectively follow complex instructions and coordinate tool use via prompts
Forward citations
Cited by 25 Pith papers
-
Cross-Modal Backdoors in Multimodal Large Language Models
Poisoning a single connector in MLLMs establishes a reusable latent backdoor pathway that transfers across modalities with over 95% attack success rate under bounded perturbations.
-
Probing Visual Planning in Image Editing Models
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
CAMEO: A Conditional and Quality-Aware Multi-Agent Image Editing Orchestrator
CAMEO uses coordinated agents for planning, prompting, generation, and quality feedback to achieve higher structural reliability in conditional image editing than single-step models.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models
RaTA-Tool retrieves suitable external tools for multimodal queries by matching generated task descriptions against tool metadata, supported by a new Hugging Face-derived dataset and DPO optimization.
-
ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution
ToolOmni combines supervised fine-tuning on a cold-start multi-turn dataset with Decoupled Multi-Objective GRPO to enable proactive retrieval and grounded execution, yielding +10.8% higher end-to-end tool-use success ...
-
Towards Long-horizon Agentic Multimodal Search
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp a...
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
MiniGPT-4 shows that aligning a frozen vision encoder to Vicuna via one projection layer plus a second-stage detailed-description fine-tune produces GPT-4-like vision-language abilities including detailed captions, cr...
-
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
MIRAGE: A Micro-Interaction Relational Architecture for Grounded Exploration in Multi-Figure Artworks
MIRAGE improves VLM analysis of multi-figure art by inserting a verifiable structured representation of micro-interactions between spatial grounding and narrative output.
-
Self-Reasoning Agentic Framework for Narrative Product Grid-Collage Generation
A self-reasoning agentic framework constructs a Product Narrative Framework, generates constraint-aware unified grid collages, and refines outputs via failure attribution to improve narrative coherence and aesthetics ...
-
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
-
Understanding the planning of LLM agents: A survey
A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[2]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision , pages 2425– 2433, 2015
work page 2015
-
[3]
Vlmo: Unified vision-language pre- training with mixture-of-modality-experts
Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, and Furu Wei. Vlmo: Unified vision-language pre- training with mixture-of-modality-experts. arXiv preprint arXiv:2111.02358, 2021
-
[4]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. In- structpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022
-
[5]
Lan- guage models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. Advances in neural in- formation processing systems, 33:1877–1901, 2020
work page 1901
-
[6]
Realtime multi-person 2d pose estimation using part affinity fields
Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017
work page 2017
- [7]
-
[8]
Visualgpt: Data-efficient adaptation of pretrained language models for image captioning
Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed El- hoseiny. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022
work page 2022
-
[9]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX , pages 104–120. Springer, 2020
work page 2020
-
[10]
Per- pixel classification is not all you need for semantic segmen- tation
Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021
work page 2021
-
[11]
Ernest Davis and Gary Marcus. Commonsense reasoning and commonsense knowledge in artificial intelligence.Com- munications of the ACM, 58(9):92–103, 2015
work page 2015
-
[12]
Magma–multimodal augmentation of generative models through adapter-based finetuning
Constantin Eichenberg, Sidney Black, Samuel Weinbach, Letitia Parcalabescu, and Anette Frank. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021
-
[13]
Violet: End- to-end video-language transformers with masked visual-token modeling
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet: End-to-end video-language transformers with masked visual-token mod- eling. arXiv preprint arXiv:2111.12681, 2021
-
[14]
Large-scale adversarial training for vision- and-language representation learning
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision- and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020
work page 2020
-
[15]
Semantic compositional networks for visual captioning
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. Semantic compositional networks for visual captioning. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 5630–5639, 2017
work page 2017
-
[16]
Towards light-weight and real-time line segment detection
Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards light-weight and real-time line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 726–734, 2022
work page 2022
-
[17]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[18]
Image-to-image translation with conditional adver- sarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adver- sarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017
work page 2017
-
[19]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019
work page 2019
-
[20]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Advances in Neural Information Pro- cessing Systems, 2022
work page 2022
-
[21]
Manigan: Text-guided image manipulation
Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. Manigan: Text-guided image manipulation. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7880–7889, 2020
work page 2020
-
[22]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In In- ternational Conference on Machine Learning, pages 12888– 12900. PMLR, 2022
work page 2022
-
[24]
Uniformer: Uni- fying convolution and self-attention for visual recognition
Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, and Yu Qiao. Uniformer: Uni- fying convolution and self-attention for visual recognition. arXiv preprint arXiv:2201.09450, 2022
-
[25]
Oscar: Object-semantics aligned pre-training for vision-language tasks
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 , pages 121–137. Springer, 2020
work page 2020
-
[26]
Su- pervision exists everywhere: A data efficient contrastive language-image pre-training paradigm
Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- pervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In International Conference on Learning Representations
-
[27]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014
work page 2014
-
[28]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[29]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agar- wal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[31]
Language models are unsu- pervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners
-
[32]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020
work page 2020
-
[33]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 12179–12188, 2021
work page 2021
-
[34]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020
work page 2020
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
work page 2022
-
[36]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili ´c, Daniel Hesslow, Roman Castagn ´e, Alexandra Sasha Luccioni, Franc ¸ois Yvon, Matthias Gall´e, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Learning to summarize with human feed- back
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea V oss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feed- back. Advances in Neural Information Processing Systems , 33:3008–3021, 2020
work page 2020
-
[38]
Multimodal few-shot learning with frozen language models
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021
work page 2021
-
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[40]
Show and tell: A neural image caption gen- erator
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015
work page 2015
-
[41]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[42]
Chain-of-thought prompting elicits reasoning in large lan- guage models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models. In Advances in Neural Information Process- ing Systems, 2022
work page 2022
-
[43]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau- mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R ´emi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Trans- formers: State-of-t...
work page 2020
-
[44]
Holistically-nested edge de- tection
Saining Xie and Zhuowen Tu. Holistically-nested edge de- tection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015
work page 2015
-
[45]
Canny edge detection based on open cv
Zhao Xu, Xu Baojie, and Wu Guoxin. Canny edge detection based on open cv. In 2017 13th IEEE international con- ference on electronic measurement & instruments (ICEMI) , pages 53–56. IEEE, 2017
work page 2017
-
[46]
An empirical study of gpt-3 for few-shot knowledge-based vqa
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022
work page 2022
-
[47]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning. In Advances in Neural Information Processing Systems
-
[48]
From recognition to cognition: Visual commonsense rea- soning
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense rea- soning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019
work page 2019
-
[49]
Merlot reserve: Neu- ral script knowledge through vision and language and sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yan- peng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, and Yejin Choi. Merlot reserve: Neu- ral script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387, 2022
work page 2022
-
[50]
Socratic models: Composing zero-shot multimodal reasoning with language,
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022
-
[51]
Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022
work page 2022
-
[52]
Lit: Zero-shot transfer with locked-image text tuning
Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 18123–18133, 2022
work page 2022
-
[53]
Adding conditional control to text-to-image diffusion models,
Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv preprint arXiv:2302.05543, 2023
-
[54]
Vinvl: Making visual representations matter in vision- language models
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Making visual representations matter in vision- language models. arXiv preprint arXiv:2101.00529, 1(6):8, 2021
-
[55]
Text as neural operator: Image manipulation by text instruction
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, and Irfan Essa. Text as neural operator: Image manipulation by text instruction. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1893– 1902, 2021
work page 1902
-
[56]
Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022
-
[57]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Sch ¨arli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables com- plex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022. A. Tool Details • Remove Something From The Photo: Model: “runwayml/stable-diffusion-inpainting” from Hu...
work page internal anchor Pith review Pith/arXiv arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.