Recognition: 2 theorem links
· Lean TheoremMM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Pith reviewed 2026-05-14 01:12 UTC · model grok-4.3
The pith
A textual prompt design lets ChatGPT collaborate with vision experts to handle advanced multimodal reasoning and action in zero-shot settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MM-REACT introduces a textual prompt design that represents text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. This design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts for tasks that exceed the capabilities of existing vision and vision-language models.
What carries the argument
The textual prompt design that encodes visual signals as text descriptions, textualized spatial coordinates, and aligned file names, enabling ChatGPT to integrate and reason with outputs from specialized vision experts.
If this is right
- Advanced vision tasks that exceed single-model limits become solvable in zero-shot fashion.
- The system applies across many real-world scenarios that demand combined visual perception and language reasoning.
- Synergetic use of existing models replaces the need for joint finetuning of language models on multimodal data.
- Prompt-based integration supports flexible swapping of different vision experts without retraining the core language model.
Where Pith is reading between the lines
- The same prompt structure could extend to other large language models beyond ChatGPT for similar multimodal gains.
- Testing information loss in the text encoding step would reveal the practical limits of this approach on high-detail imagery.
- The method suggests a modular route for building AI agents that dynamically combine reasoning and perception components.
Load-bearing premise
The textual prompt design can faithfully represent dense visual signals such as images and videos without losing critical information that language models need for accurate reasoning.
What would settle it
An experiment that measures MM-REACT performance on fine-grained spatial tasks such as precise object counting or layout description; if results match ChatGPT alone with no gain from the vision-expert prompts, the prompt design adds no value.
read the original abstract
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action. In this paper, we define and explore a comprehensive list of advanced vision tasks that are intriguing to solve, but may exceed the capabilities of existing vision and vision-language models. To achieve such advanced visual intelligence, MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos. MM-REACT's prompt design allows language models to accept, associate, and process multimodal information, thereby facilitating the synergetic combination of ChatGPT and various vision experts. Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding. Furthermore, we discuss and compare MM-REACT's system paradigm with an alternative approach that extends language models for multimodal scenarios through joint finetuning. Code, demo, video, and visualization are available at https://multimodal-react.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MM-REACT, a prompting-based system that integrates ChatGPT with a pool of vision experts via a textual prompt design encoding text descriptions, spatial coordinates, and file names for images/videos. It defines advanced vision tasks exceeding current models and claims that zero-shot experiments show the approach enables effective multimodal reasoning and action across diverse scenarios.
Significance. If substantiated, the work would be moderately significant by demonstrating a practical, training-free way to combine LLMs with specialized vision modules for tasks requiring dense visual reasoning. The availability of code and demos is a positive for reproducibility, but the absence of any quantitative metrics or baselines limits the ability to assess whether the prompting strategy truly preserves necessary visual information or outperforms alternatives.
major comments (2)
- [Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.
- [Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.
minor comments (2)
- [Discussion] The manuscript would benefit from explicit comparison tables or figures contrasting MM-REACT against joint-finetuning baselines mentioned in the discussion.
- [Method] Notation for how vision-expert outputs are serialized into the prompt (e.g., coordinate formats, file-name alignment) should be formalized with an example in a dedicated subsection for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. Our responses to the major comments are provided below. The manuscript presents MM-REACT as an exploratory system paradigm with qualitative demonstrations on novel advanced vision tasks; we address the evaluation concerns by clarifying scope and outlining additions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'zero-shot experiments demonstrate MM-REACT's effectiveness' is unsupported because the manuscript supplies no quantitative metrics, baselines, error analysis, or ablation studies on information loss in the textual prompt representation. This directly undermines verification of the synergetic combination with vision experts.
Authors: We agree that quantitative metrics, baselines, and formal error analysis are absent and would strengthen verification. The zero-shot experiments consist of diverse qualitative case studies on tasks explicitly defined to exceed current vision and vision-language models, where no established benchmarks exist. This makes standard quantitative comparison difficult. The contribution centers on the prompting paradigm enabling synergy rather than claiming superiority via metrics. In revision we will add a dedicated limitations and evaluation discussion section, including error analysis on the provided examples and suggestions for future quantitative protocols. revision: partial
-
Referee: [Prompt Design section (inferred from abstract description)] The textual prompt design (text descriptions + textualized spatial coordinates + aligned file names) is presented as sufficient to let ChatGPT faithfully process dense visual signals without critical loss, yet no information-theoretic argument, ablation on pixel-level or temporal detail preservation, or comparison to direct multimodal models is provided to support this assumption for tasks exceeding existing vision models.
Authors: The prompt design is an empirical encoding strategy shown to work in practice for the demonstrated reasoning and action tasks; we do not assert lossless preservation of all pixel-level or temporal details. No information-theoretic argument or dedicated ablation is included because the work prioritizes system-level demonstration over theoretical analysis of the representation. We will expand the prompt design section with explicit discussion of potential information loss, add comparisons to available direct multimodal models on overlapping subtasks, and note the heuristic nature of the approach. revision: partial
Circularity Check
No circularity: prompting system is self-contained engineering proposal
full rationale
The paper introduces MM-REACT as a prompting-based integration of ChatGPT with external vision experts, relying on a textual representation of images/videos (descriptions, coordinates, file names) and zero-shot experiments. No equations, fitted parameters, or derivations are present. The central claims rest on empirical demonstration and the assumption that the prompt format preserves necessary signals, without any self-citation load-bearing steps, self-definitional reductions, or renaming of known results. The approach is a practical system design rather than a closed mathematical chain, so no load-bearing step reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Language models can reliably interpret and act on textualized visual signals including spatial coordinates and file references
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action... MM-REACT introduces a textual prompt design that can represent text descriptions, textualized spatial coordinates, and aligned file names for dense visual signals such as images and videos.
-
IndisputableMonolith.Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Zero-shot experiments demonstrate MM-REACT's effectiveness in addressing the specified capabilities of interests and its wide application in different scenarios that require advanced visual understanding.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 24 Pith papers
-
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domai...
-
V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Tarot-SAM3 delivers a training-free pipeline for segmenting images from arbitrary referring expressions via expression reasoning prompts and DINOv3-based mask self-refinement.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
PresentAgent-2 generates query-driven multimodal presentation videos with research grounding, supporting single-speaker, multi-speaker discussion, and interactive question-answering modes.
-
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
-
DenTab: A Dataset for Table Recognition and Visual QA on Real-World Dental Estimates
DenTab provides 2,000 annotated dental table images and 2,208 questions to benchmark 16 systems on table structure recognition and VQA, revealing that strong layout recovery does not ensure reliable multi-step arithme...
-
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
LLM-Guided Agentic Floor Plan Parsing for Accessible Indoor Navigation of Blind and Low-Vision People
A self-correcting multi-agent LLM pipeline parses floor plans into graphs and generates accessible routes, outperforming single LLM calls with success rates up to 92% on short paths in a real university building.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
LLaMA-Adapter V2 achieves open-ended visual instruction following in LLMs by unlocking more parameters, early fusion of visual tokens, and joint training on disjoint parameter groups with only 14M added parameters.
-
Materials Informatics Across the Length Scales
A survey of data-driven methods for materials modeling at nanoscale, mesoscale, and micro-to-continuum scales that identifies established capabilities, data quality issues, and obstacles to cross-scale integration.
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Cheb- otar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a vi- sual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Lan- guage models are few-shot learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners. In NeurIPS, 2020. 2
work page 2020
-
[4]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. In ECCV, 2020. 2
work page 2020
- [5]
-
[6]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
Nus-wide: a real-world web im- age database from national university of singapore
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhip- ing Luo, and Yantao Zheng. Nus-wide: a real-world web im- age database from national university of singapore. In Pro- ceedings of the ACM international conference on image and video retrieval, pages 1–9, 2009. 2
work page 2009
-
[8]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023. 2
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 2
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
PAL: Program-aided Language Models
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022. 2
work page Pith review arXiv 2022
-
[12]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR,
-
[13]
Promptcap: Prompt-guided task- aware image captioning
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task- aware image captioning. arXiv preprint arXiv:2211.09699,
-
[14]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language mod- els. arXiv preprint arXiv:2302.14045, 2023. 2, 3
-
[15]
Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents. In Interna- tional Conference on Machine Learning , pages 9118–9147. PMLR, 2022. 2
work page 2022
-
[16]
The mir flickr retrieval evaluation
Mark J Huiskes and Michael S Lew. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM international con- ference on Multimedia information retrieval , pages 39–43,
-
[17]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[18]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014. 2
work page 2014
-
[20]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 2
work page 2015
-
[21]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022. 3
-
[22]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021. 2
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [23]
-
[24]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Can- cedda, and Thomas Scialom. Toolformer: Language mod- els can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Vipergpt: Visual inference via python execution for reasoning
D ´ıdac Sur´ıs, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 3
-
[26]
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of- thought reasoning for knowledge-intensive multi-step ques- tions. arXiv preprint arXiv:2212.10509, 2022. 2
-
[27]
Multimodal few- shot learning with frozen language models
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Es- lami, Oriol Vinyals, and Felix Hill. Multimodal few- shot learning with frozen language models. arXiv preprint arXiv:2106.13884, 2021. 2
-
[28]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2
work page 2017
-
[29]
arXiv preprint arXiv:2205.14100 , year=
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 2
-
[30]
Language models with im- age descriptors are strong few-shot video-language learners
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chen- guang Zhu, Derek Hoiem, et al. Language models with im- age descriptors are strong few-shot video-language learners. In Advances in Neural Information Processing Systems. 2
-
[31]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022. 2, 3 8
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
An empirical study of gpt-3 for few-shot knowledge-based vqa
Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yu- mao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022. 2
work page 2022
-
[34]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
Florence: A new foundation model for computer vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. 2
-
[36]
Socratic models: Composing zero-shot multimodal reasoning with language,
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choro- manski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, et al. So- cratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022. 2
-
[37]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of- thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Gen- eralized decoding for pixel, image and language. 2022. 6, 30 9 This image is a cartoon illustration of a squirrel holding a camera Visual-conditioned Joke/Meme Why would a person find this...
work page 2022
-
[39]
Heat a little olive oil in a pan and grill the tomatoes: 00:00:22 - 00:00:40
-
[40]
Cook the bacon in a frying pan: 00:00:51 - 00:01:41
-
[41]
Place the cooked bacon on a paper towel: 00:01:39 - 00:01:49
-
[42]
Spread mayo and add Worcestershire sauce: 00:01:55 - 00:02:16
-
[43]
Layer lettuce: 00:02:44 - 00:02:49 b
Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:22 - 00:03:11 a. Layer lettuce: 00:02:44 - 00:02:49 b. Add grilled tomatoes: 00:02:55 - 00:03:01 c. Place bacon on top: 00:03:15 - 00:03:27
-
[45]
Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34 I am trying to create a short version of the video, can you suggest a shorter segment for each step? Here's a suggested shorter segment for each step:
-
[46]
Heat a little olive oil in a pan and grill the tomatoes: 00:00:24 - 00:00:30
-
[47]
Cook the bacon in a frying pan: 00:00:53 - 00:00:59
-
[48]
Place the cooked bacon on a paper towel: 00:01:39 - 00:01:41
-
[49]
Spread mayo and add Worcestershire sauce: 00:02:00 - 00:02:08
-
[50]
Layer lettuce: 00:02:44 - 00:02:47 b
Assemble the sandwich (lettuce, tomatoes, bacon): 00:02:44 - 00:03:01 a. Layer lettuce: 00:02:44 - 00:02:47 b. Add grilled tomatoes: 00:02:55 - 00:02:59 c. Place bacon on top: 00:03:15 - 00:03:20
-
[51]
Sprinkle salt and pepper: 00:03:08 - 00:03:11
-
[52]
Top the sandwich with the other slice of bread: 00:03:29 - 00:03:34
-
[53]
Grill tomato 2.Fry Bacon 3.Drain Bacon 4. Spread Source 5a. Layer Lettuce 5b.Add Tomato 5c.Place Bacon 6.Sprinkle Salt/Pepper Which segment is best for video thumbnail to attract more audience? For a video thumbnail that attracts more audience, the segment showing the assembled sandwich with all its layers would be the most appealing. This image will show...
-
[58]
Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl. 3. put...
-
[59]
Gather all of the bags of flour and empty them into a large mixing bowl
-
[60]
Add any other dry ingredients, such as sugar or baking powder, to the bowl
-
[61]
Mix the dry ingredients together until they are well combined
-
[62]
Add any liquids, such as milk or eggs, to the bowl and mix until the batter is smooth
-
[63]
Pour the batter into a greased cake pan and bake according to the recipe instructions. Physical Prediction PaLM-E Ours When a human asks me to do a task, I will respond with the sequence of actions I would do to accomplish the task with only the items I see. Human: Use all of the ingredients you see to make a cake batter. 1. crack egg. 2. put egg in bowl....
-
[64]
clean up. PaLM-E Figure 17. Comparison of MM-R EACT with PaLM-E [10] on illustrated capabilities. We empirically show that text prompts are as effective as expensive joint fine-tuning in solving complicated vision problems. 22 Convert image to <ImagePath> and send to ChatGPT Thought 1 ChatGPT: This is an image. Action 1 Assistant, what objects do you see i...
-
[65]
Start” label and ends with the “Stop
($130.96 + $82.56 + $70.55) / 3 = $101.07. Therefore, the average charging cost is $82.41 cheaper than the average fueling cost. This is calculated by subtracting the average charging cost from the average fueling cost. $101.07 - $18.66 = $82.41. Convert image to <ImagePath> and send to ChatGPT Figure 19. Unfolded multimodal reasoning and action steps for...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.