arxiv: 2303.11381 · v1 · submitted 2023-03-20 · 💻 cs.CV · cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Ce Liu, Ehsan Azarnasab, Faisal Ahmed, Jianfeng Wang, Kevin Lin, Lijuan Wang, Linjie Li, Michael Zeng, Zhengyuan Yang, Zicheng Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 01:12 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords multimodal reasoningChatGPTvision expertstextual prompt designzero-shot learningvisual understandingmultimodal AIreasoning and action

0 comments

The pith

A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MM-REACT as a system that connects ChatGPT to a pool of vision experts through a specific textual prompt format. This format converts dense visual inputs such as images and videos into text descriptions, spatial coordinates, and aligned file names so the language model can accept and reason over them. The approach targets advanced vision tasks that current models struggle with, including complex reasoning over visual scenes. Zero-shot experiments show the combined system performs effectively across varied scenarios requiring deep visual understanding. It positions this prompting method as an alternative to training new multimodal models end-to-end.

Core claim

MM-REACT introduces a textual prompt design that represents text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. This design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts for tasks that exceed the capabilities of existing vision and vision-language models.

What carries the argument

The textual prompt design that encodes visual signals as text descriptions, textualized spatial coordinates, and aligned file names, enabling ChatGPT to integrate and reason with outputs from specialized vision experts.

If this is right

Advanced vision tasks that exceed single-model limits become solvable in zero-shot fashion.
The system applies across many real-world scenarios that demand combined visual perception and language reasoning.
Synergetic use of existing models replaces the need for joint finetuning of language models on multimodal data.
Prompt-based integration supports flexible swapping of different vision experts without retraining the core language model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompt structure could extend to other large language models beyond ChatGPT for similar multimodal gains.
Testing information loss in the text encoding step would reveal the practical limits of this approach on high-detail imagery.
The method suggests a modular route for building AI agents that dynamically combine reasoning and perception components.

Load-bearing premise

The textual prompt design can faithfully represent dense visual signals such as images and videos without losing critical information that language models need for accurate reasoning.

What would settle it

An experiment that measures MM-REACT performance on fine-grained spatial tasks such as precise object counting or layout description; if results match ChatGPT alone with no gain from the vision-expert prompts, the prompt design adds no value.

read the original abstract

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-REACT shows a workable prompting setup to link ChatGPT to vision experts via text-encoded coordinates and files, but its zero-shot claims rest on demos without metrics or ablations.

read the letter

The paper's main contribution is a prompt format that packs text descriptions, spatial coordinates, and file references into something ChatGPT can use to call separate vision models. This lets the system tackle tasks like complex scene reasoning or action planning that standard vision-language models handle poorly, all without joint finetuning. They lay out a list of advanced capabilities and demonstrate the orchestration in zero-shot examples across different scenarios. Releasing code, a demo, and visualizations makes the approach easy to inspect and build on. That concrete design choice is the part worth paying attention to, as it gives a practical route for combining existing components rather than training new ones from scratch. The evaluation stays qualitative. The abstract and body rely on example outputs to claim effectiveness, with no reported metrics, baselines, or error analysis to show gains over using the vision experts alone or other prompting methods. The central assumption—that turning dense visual signals into text preserves the information needed for accurate reasoning—also lacks any ablation or test for information loss on fine details like texture or timing. Without those checks, it's hard to know where the system actually breaks. This is aimed at people working on multimodal agents, robotics interfaces, or content tools who want to leverage large language models quickly with off-the-shelf vision components. It deserves a serious referee because the prompt design is a clear, implementable idea that engages real system-level questions, even if the experiments need more rigor to stand up to scrutiny. Send it for review and ask for quantitative comparisons and failure analysis in revisions.

Referee Report

2 major / 2 minor

Summary. The paper proposes MM-REACT, a prompting-based system that integrates ChatGPT with a pool of vision experts via a textual prompt design encoding text descriptions, spatial coordinates, and file names for images/videos. It defines advanced vision tasks exceeding current models and claims that zero-shot experiments show the approach enables effective multimodal reasoning and action across diverse scenarios.

Significance. If substantiated, the work would be moderately significant by demonstrating a practical, training-free way to combine LLMs with specialized vision modules for tasks requiring dense visual reasoning. The availability of code and demos is a positive for reproducibility, but the absence of any quantitative metrics or baselines limits the ability to assess whether the prompting strategy truly preserves necessary visual information or outperforms alternatives.

major comments (2)

[Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.
[Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.

minor comments (2)

[Discussion] The manuscript would benefit from explicit comparison tables or figures contrasting MM-REACT against joint-finetuning baselines mentioned in the discussion.
[Method] Notation for how vision-expert outputs are serialized into the prompt (e.g., coordinate formats, file-name alignment) should be formalized with an example in a dedicated subsection for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. Our responses to the major comments are provided below. The manuscript presents MM-REACT as an exploratory system paradigm with qualitative demonstrations on novel advanced vision tasks; we address the evaluation concerns by clarifying scope and outlining additions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.

Authors: We agree that quantitative metrics, baselines, and formal error analysis are absent and would strengthen verification. The zero-shot experiments consist of diverse qualitative case studies on tasks explicitly defined to exceed current vision and vision-language models, where no established benchmarks exist. This makes standard quantitative comparison difficult. The contribution centers on the prompting paradigm enabling synergy rather than claiming superiority via metrics. In revision we will add a dedicated limitations and evaluation discussion section, including error analysis on the provided examples and suggestions for future quantitative protocols. revision: partial
Referee: [Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.

Authors: The prompt design is an empirical encoding strategy shown to work in practice for the demonstrated reasoning and action tasks; we do not assert lossless preservation of all pixel-level or temporal details. No information-theoretic argument or dedicated ablation is included because the work prioritizes system-level demonstration over theoretical analysis of the representation. We will expand the prompt design section with explicit discussion of potential information loss, add comparisons to available direct multimodal models on overlapping subtasks, and note the heuristic nature of the approach. revision: partial

Circularity Check

0 steps flagged

No circularity: prompting system is self-contained engineering proposal

full rationale

The paper introduces MM-REACT as a prompting-based integration of ChatGPT with external vision experts, relying on a textual representation of images/videos (descriptions, coordinates, file names) and zero-shot experiments. No equations, fitted parameters, or derivations are present. The central claims rest on empirical demonstration and the assumption that the prompt format preserves necessary signals, without any self-citation load-bearing steps, self-definitional reductions, or renaming of known results. The approach is a practical system design rather than a closed mathematical chain, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that existing vision experts can be effectively orchestrated via text prompts without introducing new fitted parameters or postulated entities.

axioms (1)

domain assumption Language models can reliably interpret and act on textualized visual signals including spatial coordinates and file references
Invoked in the description of the prompt design that enables multimodal processing.

pith-pipeline@v0.9.0 · 5518 in / 1087 out tokens · 55688 ms · 2026-05-14T01:12:43.672888+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action... MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos.
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
cs.CV 2026-05 conditional novelty 7.0

AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
cs.CV 2026-03 conditional novelty 7.0

OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
cs.CV 2026-05 unverdicted novelty 6.0

PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
cs.CV 2026-05 unverdicted novelty 6.0

AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
cs.CV 2026-04 unverdicted novelty 6.0

DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
cs.CV 2026-05 unverdicted novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
cs.AI 2026-04 unverdicted novelty 5.0

A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Materials Informatics Across the Length Scales
cond-mat.mtrl-sci 2026-04 unverdicted novelty 2.0

A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · cited by 22 Pith papers · 13 internal anchors

[1]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Lan- guage models are few-shot learners

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 2

work page 2020
[4]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In ECCV, 2020. 2

work page 2020
[5]

Langchain

Harrison Chase. Langchain. https://langchain. readthedocs.io/, 2023. 4 6 Figure 4. Case studies of MM-R EACT’s capabilities and application scenarios: visual math and text reasoning. 7

work page 2023
[6]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Nus-wide: a real-world web im- age database from national university of singapore

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhip- ing Luo, and Yantao Zheng. Nus-wide: a real-world web im- age database from national university of singapore. In Pro- ceedings of the ACM international conference on image and video retrieval, pages 1–9, 2009. 2

work page 2009
[8]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023. 2

work page arXiv 2023
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Pal: Program-aided language models,

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022. 2

work page arXiv 2022
[12]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,

work page
[13]

Promptcap: Prompt-guided task- aware image captioning

Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699,

work page arXiv
[14]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language mod- els. arXiv preprint arXiv:2302.14045, 2023. 2, 3

work page arXiv 2023
[15]

Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents

Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents. In Interna- tional Conference on Machine Learning , pages 9118–9147. PMLR, 2022. 2

work page 2022
[16]

The mir ﬂickr retrieval evaluation

Mark J Huiskes and Michael S Lew. The mir ﬂickr retrieval evaluation. In Proceedings of the 1st ACM international con- ference on Multimedia information retrieval , pages 39–43,

work page
[17]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2

work page 2014
[20]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 2

work page 2015
[21]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022. 3

work page arXiv 2022
[22]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. 2023. 2, 4

work page 2023
[24]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Vipergpt: Visual inference via python execution for reasoning

D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 3

work page arXiv 2023
[26]

innocent until proven guilty

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step ques- tions. arXiv preprint arXiv:2212.10509, 2022. 2

work page arXiv 2022
[27]

Multimodal few- shot learning with frozen language models

Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few- shot learning with frozen language models. arXiv preprint arXiv:2106.13884, 2021. 2

work page arXiv 2021
[28]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2

work page 2017
[29]

arXiv preprint arXiv:2205.14100 , year=

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 2

work page arXiv 2022
[30]

Language models with im- age descriptors are strong few-shot video-language learners

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. In Advances in Neural Information Processing Systems. 2

work page
[31]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 2, 3 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

An empirical study of gpt-3 for few-shot knowledge-based vqa

Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, pages 3081–3089, 2022. 2

work page 2022
[34]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Florence: A new foundation model for computer vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. 2

work page arXiv 2021
[36]

Socratic models: Composing zero-shot multimodal reasoning with language,

Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2

work page arXiv 2022
[37]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

@INNA_S_ART

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Gen- eralized decoding for pixel, image and language. 2022. 6, 30 9 This image is a cartoon illustration of a squirrel holding a camera Visual-conditioned Joke/Meme Why would a person find this...

work page 2022
[39]

Heat a little olive oil in a pan and grill the tomatoes: 00:00:22 - 00:00:40

work page
[40]

Cook the bacon in a frying pan: 00:00:51 - 00:01:41

work page
[41]

Place the cooked bacon on a paper towel: 00:01:39 - 00:01:49

work page
[42]

Spread mayo and add Worcestershire sauce: 00:01:55 - 00:02:16

work page
[43]

Layer lettuce: 00:02:44 - 00:02:49 b

Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:22 - 00:03:11 a. Layer lettuce: 00:02:44 - 00:02:49 b. Add grilled tomatoes: 00:02:55 - 00:03:01 c. Place bacon on top: 00:03:15 - 00:03:27

work page
[45]

Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34 I am trying to create a short version of the video, can you suggest a shorter segment for each step? Here's a suggested shorter segment for each step:

work page
[46]

Heat a little olive oil in a pan and grill the tomatoes: 00:00:24 - 00:00:30

work page
[47]

Cook the bacon in a frying pan: 00:00:53 - 00:00:59

work page
[48]

Place the cooked bacon on a paper towel: 00:01:39 - 00:01:41

work page
[49]

Spread mayo and add Worcestershire sauce: 00:02:00 - 00:02:08

work page
[50]

Layer lettuce: 00:02:44 - 00:02:47 b

Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:44 - 00:03:01 a. Layer lettuce: 00:02:44 - 00:02:47 b. Add grilled tomatoes: 00:02:55 - 00:02:59 c. Place bacon on top: 00:03:15 - 00:03:20

work page
[51]

Sprinkle salt and pepper: 00:03:08 - 00:03:11

work page
[52]

Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34

work page
[53]

Spread Source 5a

Grill tomato 2.Fry Bacon 3.Drain Bacon 4. Spread Source 5a. Layer Lettuce 5b.Add Tomato 5c.Place Bacon 6.Sprinkle Salt/Pepper Which segment is best for video thumbnail to attract more audience? For a video thumbnail that attracts more audience, the segment showing the assembled sandwich with all its layers would be the most appealing. This image will show...

work page
[58]

Physical Prediction Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see

Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl. 3. put...

work page
[59]

Gather all of the bags of flour and empty them into a large mixing bowl

work page
[60]

Add any other dry ingredients, such as sugar or baking powder, to the bowl

work page
[61]

Mix the dry ingredients together until they are well combined

work page
[62]

Add any liquids, such as milk or eggs, to the bowl and mix until the batter is smooth

work page
[63]

Physical Prediction PaLM-E Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see

Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction PaLM-E Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl....

work page
[64]

@INNA_S_ART

clean up. PaLM-E Figure 17. Comparison of MM-R EACT with PaLM-E [10] on illustrated capabilities. We empirically show that text prompts are as effective as expensive joint ﬁne-tuning in solving complicated vision problems. 22 Convert image to <ImagePath> and send to ChatGPT Thought 1 ChatGPT: This is an image. Action 1 Assistant, what objects do you see i...

work page
[65]

Start” label and ends with the “Stop

($130.96 + $82.56 + $70.55) / 3 = $101.07. Therefore, the average charging cost is $82.41 cheaper than the average fueling cost. This is calculated by subtracting the average charging cost from the average fueling cost. $101.07 - $18.66 = $82.41. Convert image to <ImagePath> and send to ChatGPT Figure 19. Unfolded multimodal reasoning and action steps for...

work page