pith. machine review for the scientific record. sign in

arxiv: 2303.11381 · v1 · submitted 2023-03-20 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Ce Liu, Ehsan Azarnasab, Faisal Ahmed, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Michael Zeng, Zhengyuan Yang, Zicheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:12 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG
keywords multimodal reasoningChatGPTvision expertstextual prompt designzero-shot learningvisual understandingmultimodal AIreasoning and action
0
0 comments X

The pith

A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MM-REACT as a system that connects ChatGPT to a pool of vision experts through a specific textual prompt format. This format converts dense visual inputs such as images and videos into text descriptions, spatial coordinates, and aligned file names so the language model can accept and reason over them. The approach targets advanced vision tasks that current models struggle with, including complex reasoning over visual scenes. Zero-shot experiments show the combined system performs effectively across varied scenarios requiring deep visual understanding. It positions this prompting method as an alternative to training new multimodal models end-to-end.

Core claim

MM-REACT introduces a textual prompt design that represents text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. This design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts for tasks that exceed the capabilities of existing vision and vision-language models.

What carries the argument

The textual prompt design that encodes visual signals as text descriptions, textualized spatial coordinates, and aligned file names, enabling ChatGPT to integrate and reason with outputs from specialized vision experts.

If this is right

  • Advanced vision tasks that exceed single-model limits become solvable in zero-shot fashion.
  • The system applies across many real-world scenarios that demand combined visual perception and language reasoning.
  • Synergetic use of existing models replaces the need for joint finetuning of language models on multimodal data.
  • Prompt-based integration supports flexible swapping of different vision experts without retraining the core language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompt structure could extend to other large language models beyond ChatGPT for similar multimodal gains.
  • Testing information loss in the text encoding step would reveal the practical limits of this approach on high-detail imagery.
  • The method suggests a modular route for building AI agents that dynamically combine reasoning and perception components.

Load-bearing premise

The textual prompt design can faithfully represent dense visual signals such as images and videos without losing critical information that language models need for accurate reasoning.

What would settle it

An experiment that measures MM-REACT performance on fine-grained spatial tasks such as precise object counting or layout description; if results match ChatGPT alone with no gain from the vision-expert prompts, the prompt design adds no value.

read the original abstract

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MM-REACT, a prompting-based system that integrates ChatGPT with a pool of vision experts via a textual prompt design encoding text descriptions, spatial coordinates, and file names for images/videos. It defines advanced vision tasks exceeding current models and claims that zero-shot experiments show the approach enables effective multimodal reasoning and action across diverse scenarios.

Significance. If substantiated, the work would be moderately significant by demonstrating a practical, training-free way to combine LLMs with specialized vision modules for tasks requiring dense visual reasoning. The availability of code and demos is a positive for reproducibility, but the absence of any quantitative metrics or baselines limits the ability to assess whether the prompting strategy truly preserves necessary visual information or outperforms alternatives.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.
  2. [Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.
minor comments (2)
  1. [Discussion] The manuscript would benefit from explicit comparison tables or figures contrasting MM-REACT against joint-finetuning baselines mentioned in the discussion.
  2. [Method] Notation for how vision-expert outputs are serialized into the prompt (e.g., coordinate formats, file-name alignment) should be formalized with an example in a dedicated subsection for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Our responses to the major comments are provided below. The manuscript presents MM-REACT as an exploratory system paradigm with qualitative demonstrations on novel advanced vision tasks; we address the evaluation concerns by clarifying scope and outlining additions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.

    Authors: We agree that quantitative metrics, baselines, and formal error analysis are absent and would strengthen verification. The zero-shot experiments consist of diverse qualitative case studies on tasks explicitly defined to exceed current vision and vision-language models, where no established benchmarks exist. This makes standard quantitative comparison difficult. The contribution centers on the prompting paradigm enabling synergy rather than claiming superiority via metrics. In revision we will add a dedicated limitations and evaluation discussion section, including error analysis on the provided examples and suggestions for future quantitative protocols. revision: partial

  2. Referee: [Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.

    Authors: The prompt design is an empirical encoding strategy shown to work in practice for the demonstrated reasoning and action tasks; we do not assert lossless preservation of all pixel-level or temporal details. No information-theoretic argument or dedicated ablation is included because the work prioritizes system-level demonstration over theoretical analysis of the representation. We will expand the prompt design section with explicit discussion of potential information loss, add comparisons to available direct multimodal models on overlapping subtasks, and note the heuristic nature of the approach. revision: partial

Circularity Check

0 steps flagged

No circularity: prompting system is self-contained engineering proposal

full rationale

The paper introduces MM-REACT as a prompting-based integration of ChatGPT with external vision experts, relying on a textual representation of images/videos (descriptions, coordinates, file names) and zero-shot experiments. No equations, fitted parameters, or derivations are present. The central claims rest on empirical demonstration and the assumption that the prompt format preserves necessary signals, without any self-citation load-bearing steps, self-definitional reductions, or renaming of known results. The approach is a practical system design rather than a closed mathematical chain, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that existing vision experts can be effectively orchestrated via text prompts without introducing new fitted parameters or postulated entities.

axioms (1)
  • domain assumption Language models can reliably interpret and act on textualized visual signals including spatial coordinates and file references
    Invoked in the description of the prompt design that enables multimodal processing.

pith-pipeline@v0.9.0 · 5518 in / 1087 out tokens · 55688 ms · 2026-05-14T01:12:43.672888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.

  2. AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation

    cs.CV 2026-05 conditional novelty 7.0

    AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...

  3. V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.

  4. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  5. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  6. Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.

  7. OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

    cs.CV 2026-03 conditional novelty 7.0

    OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.

  8. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  9. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    cs.CV 2023-10 accept novelty 7.0

    Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.

  10. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  11. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  12. PresentAgent-2: Towards Generalist Multimodal Presentation Agents

    cs.CV 2026-05 unverdicted novelty 6.0

    PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.

  13. Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...

  14. Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.

  15. AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

    cs.CV 2026-05 unverdicted novelty 6.0

    AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.

  16. DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates

    cs.CV 2026-04 unverdicted novelty 6.0

    DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...

  17. Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.

  18. Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

    cs.CV 2023-11 unverdicted novelty 6.0

    Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

  19. Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

    cs.CV 2026-05 unverdicted novelty 5.0

    MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

  20. LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People

    cs.AI 2026-04 unverdicted novelty 5.0

    A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.

  21. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

    cs.CV 2023-12 unverdicted novelty 5.0

    InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

  22. Materials Informatics Across the Length Scales

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 2.0

    A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 22 Pith papers · 13 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 2

  2. [2]

    Flamingo: a Visual Language Model for Few-Shot Learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 2

  3. [3]

    Lan- guage models are few-shot learners

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 2

  4. [4]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In ECCV, 2020. 2

  5. [5]

    Langchain

    Harrison Chase. Langchain. https://langchain. readthedocs.io/, 2023. 4 6 Figure 4. Case studies of MM-R EACT’s capabilities and application scenarios: visual math and text reasoning. 7

  6. [6]

    PaLM: Scaling Language Modeling with Pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

  7. [7]

    Nus-wide: a real-world web im- age database from national university of singapore

    Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhip- ing Luo, and Yantao Zheng. Nus-wide: a real-world web im- age database from national university of singapore. In Pro- ceedings of the ACM international conference on image and video retrieval, pages 1–9, 2009. 2

  8. [8]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023. 2

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

  10. [10]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  11. [11]

    Pal: Program-aided language models,

    Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022. 2

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

  13. [13]

    Promptcap: Prompt-guided task- aware image captioning

    Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699,

  14. [14]

    Language is not all you need: Aligning perception with language models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language mod- els. arXiv preprint arXiv:2302.14045, 2023. 2, 3

  15. [15]

    Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents. In Interna- tional Conference on Machine Learning , pages 9118–9147. PMLR, 2022. 2

  16. [16]

    The mir flickr retrieval evaluation

    Mark J Huiskes and Michael S Lew. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international con- ference on Multimedia information retrieval , pages 39–43,

  17. [17]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 3

  18. [18]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 2

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2

  20. [20]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 2

  21. [21]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022. 3

  22. [22]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. 2

  23. [23]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. 2023. 2, 4

  24. [24]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. 2

  25. [25]

    Vipergpt: Visual inference via python execution for reasoning

    D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 3

  26. [26]

    innocent until proven guilty

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step ques- tions. arXiv preprint arXiv:2212.10509, 2022. 2

  27. [27]

    Multimodal few- shot learning with frozen language models

    Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few- shot learning with frozen language models. arXiv preprint arXiv:2106.13884, 2021. 2

  28. [28]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2

  29. [29]

    arXiv preprint arXiv:2205.14100 , year=

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 2

  30. [30]

    Language models with im- age descriptors are strong few-shot video-language learners

    Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. In Advances in Neural Information Processing Systems. 2

  31. [31]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 2, 3 8

  32. [32]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 3

  33. [33]

    An empirical study of gpt-3 for few-shot knowledge-based vqa

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022. 2

  34. [34]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022. 2, 4

  35. [35]

    Florence: A new foundation model for computer vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. 2

  36. [36]

    Socratic models: Composing zero-shot multimodal reasoning with language,

    Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2

  37. [37]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 3

  38. [38]

    @INNA_S_ART

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Gen- eralized decoding for pixel, image and language. 2022. 6, 30 9 This image is a cartoon illustration of a squirrel holding a camera Visual-conditioned Joke/Meme Why would a person find this...

  39. [39]

    Heat a little olive oil in a pan and grill the tomatoes: 00:00:22 - 00:00:40

  40. [40]

    Cook the bacon in a frying pan: 00:00:51 - 00:01:41

  41. [41]

    Place the cooked bacon on a paper towel: 00:01:39 - 00:01:49

  42. [42]

    Spread mayo and add Worcestershire sauce: 00:01:55 - 00:02:16

  43. [43]

    Layer lettuce: 00:02:44 - 00:02:49 b

    Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:22 - 00:03:11 a. Layer lettuce: 00:02:44 - 00:02:49 b. Add grilled tomatoes: 00:02:55 - 00:03:01 c. Place bacon on top: 00:03:15 - 00:03:27

  44. [45]

    Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34 I am trying to create a short version of the video, can you suggest a shorter segment for each step? Here's a suggested shorter segment for each step:

  45. [46]

    Heat a little olive oil in a pan and grill the tomatoes: 00:00:24 - 00:00:30

  46. [47]

    Cook the bacon in a frying pan: 00:00:53 - 00:00:59

  47. [48]

    Place the cooked bacon on a paper towel: 00:01:39 - 00:01:41

  48. [49]

    Spread mayo and add Worcestershire sauce: 00:02:00 - 00:02:08

  49. [50]

    Layer lettuce: 00:02:44 - 00:02:47 b

    Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:44 - 00:03:01 a. Layer lettuce: 00:02:44 - 00:02:47 b. Add grilled tomatoes: 00:02:55 - 00:02:59 c. Place bacon on top: 00:03:15 - 00:03:20

  50. [51]

    Sprinkle salt and pepper: 00:03:08 - 00:03:11

  51. [52]

    Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34

  52. [53]

    Spread Source 5a

    Grill tomato 2.Fry Bacon 3.Drain Bacon 4. Spread Source 5a. Layer Lettuce 5b.Add Tomato 5c.Place Bacon 6.Sprinkle Salt/Pepper Which segment is best for video thumbnail to attract more audience? For a video thumbnail that attracts more audience, the segment showing the assembled sandwich with all its layers would be the most appealing. This image will show...

  53. [58]

    Physical Prediction Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see

    Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl. 3. put...

  54. [59]

    Gather all of the bags of flour and empty them into a large mixing bowl

  55. [60]

    Add any other dry ingredients, such as sugar or baking powder, to the bowl

  56. [61]

    Mix the dry ingredients together until they are well combined

  57. [62]

    Add any liquids, such as milk or eggs, to the bowl and mix until the batter is smooth

  58. [63]

    Physical Prediction PaLM-E Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see

    Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction PaLM-E Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl....

  59. [64]

    @INNA_S_ART

    clean up. PaLM-E Figure 17. Comparison of MM-R EACT with PaLM-E [10] on illustrated capabilities. We empirically show that text prompts are as effective as expensive joint fine-tuning in solving complicated vision problems. 22 Convert image to <ImagePath> and send to ChatGPT Thought 1 ChatGPT: This is an image. Action 1 Assistant, what objects do you see i...

  60. [65]

    Start” label and ends with the “Stop

    ($130.96 + $82.56 + $70.55) / 3 = $101.07. Therefore, the average charging cost is $82.41 cheaper than the average fueling cost. This is calculated by subtracting the average charging cost from the average fueling cost. $101.07 - $18.66 = $82.41. Convert image to <ImagePath> and send to ChatGPT Figure 19. Unfolded multimodal reasoning and action steps for...