arxiv: 2302.00923 · v5 · submitted 2023-02-02 · 💻 cs.CL · cs.AI· cs.CV

Recognition: no theorem link

Multimodal Chain-of-Thought Reasoning in Language Models

Alex Smola, Aston Zhang, George Karypis, Hai Zhao, Mu Li, Zhuosheng Zhang

Pith reviewed 2026-05-12 18:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV

keywords multimodal chain-of-thoughtlanguage modelsScienceQArationale generationanswer inferencevision-language reasoninghallucination

0 comments

The pith

Multimodal-CoT separates rationale generation from answer inference to enable state-of-the-art reasoning in small language models using both text and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Multimodal-CoT, which extends chain-of-thought prompting to include visual inputs through a two-stage process. In the first stage, the model generates rationales grounded in both language and vision data. The second stage then uses those rationales to infer the final answer. This structure allows smaller models, under one billion parameters, to outperform larger ones on the ScienceQA benchmark. The approach also helps reduce hallucinations and speeds up model convergence.

Core claim

By separating the generation of multimodal rationales from the subsequent answer inference step, language models can leverage richer information from images and text to produce more accurate reasoning chains, resulting in state-of-the-art performance on ScienceQA with models under 1 billion parameters.

What carries the argument

The two-stage framework separating rationale generation (using multimodal inputs) from answer inference.

If this is right

Answer inference benefits from higher-quality rationales informed by both modalities.
The method mitigates hallucination in generated outputs.
Training convergence is accelerated compared to standard approaches.
Strong results transfer to the A-OKVQA benchmark as well.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar separation of reasoning steps could improve performance on other vision-language tasks not tested here.
Smaller models might close the gap with larger ones across more benchmarks if this two-stage pattern is adopted broadly.
Explicit stage separation may serve as a general technique to enhance chain-of-thought reliability in multimodal settings.

Load-bearing premise

That separating rationale generation from answer inference will reliably produce higher-quality multimodal rationales without introducing new error modes or requiring task-specific tuning that offsets the gains.

What would settle it

A controlled experiment showing that a model using the two-stage Multimodal-CoT does not outperform an equivalent single-stage multimodal prompting baseline on ScienceQA accuracy.

read the original abstract

Large language models (LLMs) have shown impressive performance on complex reasoning by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains as the rationale to infer the answer. However, existing CoT studies have primarily focused on the language modality. We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference. In this way, answer inference can leverage better generated rationales that are based on multimodal information. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach. With Multimodal-CoT, our model under 1 billion parameters achieves state-of-the-art performance on the ScienceQA benchmark. Our analysis indicates that Multimodal-CoT offers the advantages of mitigating hallucination and enhancing convergence speed. Code is publicly available at https://github.com/amazon-science/mm-cot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Multimodal-CoT, a two-stage framework for chain-of-thought reasoning in language models that incorporates both text and image modalities. The first stage generates multimodal rationales, and the second stage performs answer inference using those rationales. Experiments on the ScienceQA and A-OKVQA benchmarks demonstrate performance gains, with a model under 1 billion parameters achieving state-of-the-art results on ScienceQA; additional analysis claims reduced hallucination and faster convergence. The code is released publicly.

Significance. If the empirical results hold, the work shows that separating rationale generation from answer inference can improve multimodal reasoning performance even for sub-billion-parameter models, extending CoT techniques beyond language-only settings. Public code availability aids reproducibility and enables further exploration of hallucination mitigation in vision-language tasks.

major comments (1)

[§4 (Experiments)] §4 (Experiments): The central claim that the two-stage separation produces higher-quality multimodal rationales rests on overall benchmark gains, but without full ablation tables isolating the contribution of rationale generation versus joint multimodal prompting (or single-stage baselines), it is difficult to rule out that gains arise from other factors such as prompt engineering or training details.

minor comments (2)

[Abstract / §1] The abstract and §1 could more precisely quantify the SOTA margin on ScienceQA (e.g., absolute accuracy delta versus prior best) and specify the exact model architecture and parameter count used.
[Figure 2] Figure 2 and associated text would benefit from clearer labeling of the two stages and explicit comparison of rationale quality metrics (e.g., human or automatic evaluation of rationale faithfulness).

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the constructive comment on the experimental section. We address the major comment below.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central claim that the two-stage separation produces higher-quality multimodal rationales rests on overall benchmark gains, but without full ablation tables isolating the contribution of rationale generation versus joint multimodal prompting (or single-stage baselines), it is difficult to rule out that gains arise from other factors such as prompt engineering or training details.

Authors: We appreciate this point and agree that stronger isolation of the two-stage design would further substantiate the central claim. The current manuscript includes ablation studies in Section 4.3 that compare the full two-stage Multimodal-CoT against single-stage multimodal baselines (direct answer inference without separate rationale generation) as well as language-only CoT variants. These results show consistent gains from the two-stage separation on ScienceQA, including lower hallucination rates. However, to more rigorously rule out confounds from prompt engineering or training details, we will add expanded ablation tables in the revision. These will include controlled comparisons of rationale generation versus joint multimodal prompting under matched training and prompt conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes Multimodal-CoT as an empirical two-stage prompting framework that separates rationale generation from answer inference to incorporate both text and image modalities. All central claims rest on reported experimental results on the held-out ScienceQA and A-OKVQA benchmarks rather than any mathematical derivation, fitted parameter, or self-referential definition. No equations, uniqueness theorems, or ansatzes are invoked that could reduce the method to its own inputs by construction; the separation of stages is presented as a design choice whose value is measured externally. The work is therefore self-contained as an engineering technique with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical methods paper whose central claim rests on experimental validation of a prompting framework rather than on new axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5468 in / 1031 out tokens · 31061 ms · 2026-05-12T18:08:05.883026+00:00 · methodology

discussion (0)

Forward citations

Cited by 28 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

V-ABS is an action-observer beam search method with entropy-based adaptive weighting and an 80k-sample SFT dataset that delivers 19.7% average gains on visual reasoning tasks for MLLMs.
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
cs.CV 2023-03 accept novelty 7.0

Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
Can MLLMs Reason About Visual Persuasion? Evaluating the Efficacy and Faithfulness of Reasoning
cs.CV 2026-05 conditional novelty 6.0

Diverse teacher-generated rationales improve MLLM visual persuasiveness prediction via supervised fine-tuning, while a new three-dimensional faithfulness framework shows that prediction accuracy alone does not ensure ...
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, an...
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...
V-tableR1: Process-Supervised Multimodal Table Reasoning with Critic-Guided Policy Optimization
cs.AI 2026-04 unverdicted novelty 6.0

V-tableR1 uses a critic VLM for dense step-level feedback and a new PGPO algorithm to shift multimodal table reasoning from pattern matching to verifiable logical steps, achieving SOTA accuracy with a 4B open-source model.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
cs.CV 2026-04 unverdicted novelty 6.0

A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.
From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning
cs.AI 2026-04 unverdicted novelty 6.0

EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.
AITP: Traffic Accident Responsibility Allocation via Multimodal Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

AITP is a new multimodal large language model that uses multimodal chain-of-thought and retrieval-augmented generation of legal knowledge to achieve state-of-the-art results on traffic accident responsibility allocati...
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization
cs.CV 2026-04 unverdicted novelty 6.0

MAPO improves multimodal chain-of-thought reasoning by requiring explicit textual descriptions of visual tool results and using a novel advantage estimator that combines semantic alignment with task rewards.
Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Position and step penalty plus visual reasoning guidance fix premature answering and weak visual grounding in diffusion MLLMs, delivering up to 7.5% accuracy gains and over 3x speedup.
CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
cs.AI 2023-03 conditional novelty 6.0

CAMEL proposes a role-playing framework with inception prompting that enables autonomous multi-agent cooperation among LLMs and generates conversational data for studying their behaviors.
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
cs.CV 2023-03 unverdicted novelty 6.0

MM-REACT uses textual prompts to let ChatGPT collaborate with external vision experts for zero-shot multimodal reasoning and action on advanced visual tasks.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
cs.CL 2026-04 unverdicted novelty 5.0

DIAGRAMS introduces a schema-driven annotation tool that proposes reasoning-level evidence regions for Diagram QA pairs and reports 85.39% precision and 75.30% recall against human final selections on six datasets.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Towards Robust Endogenous Reasoning: Unifying Drift Adaptation in Non-Stationary Tuning
cs.LG 2026-04 unverdicted novelty 5.0

CPO++ adapts reinforcement fine-tuning of MLLMs to endogenous multi-modal concept drift through counterfactual reasoning and preference optimization, yielding better coherence and cross-domain robustness in safety-cri...
A Progressive Training Strategy for Vision-Language Models to Counteract Spatio-Temporal Hallucinations in Embodied Reasoning
cs.AI 2026-04 unverdicted novelty 5.0

A progressive training framework using spatiotemporal chain-of-thought data reduces the forward-backward temporal query performance gap in VLMs from over 70% to 6.53%.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Decompose, Look, and Reason: Reinforced Latent Reasoning for VLMs
cs.CL 2026-04 unverdicted novelty 5.0

DLR is a new reinforced latent reasoning method for VLMs that decomposes queries, uses continuous visual latents, and outperforms text-only and multimodal CoT baselines on vision-centric benchmarks with better interpr...

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · cited by 28 Pith papers · 9 internal anchors

[1]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 6077–6086. IEEE Computer Society,

work page 2018
[2]

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al

doi: 10.1109/CVPR.2018.00636. Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open-source framework for training large autoregressive vision-language models.arXiv preprint arXiv:2308.01390,

work page doi:10.1109/cvpr.2018.00636 2018
[3]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[4]

End-to-end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I, pp. 213–229,

work page 2020
[5]

Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E. Hinton. Big self-supervised models are strong semi-supervised learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.),Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing...

work page 2020
[6]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Dis- entangling computation from reasoning for numerical reasoning tasks.ArXiv preprint, abs/2211.12588,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning.ArXiv preprint, abs/2301.05226,

Zhenfang Chen, Qinhong Zhou, Yikang Shen, Yining Hong, Hao Zhang, and Chuang Gan. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning.ArXiv preprint, abs/2301.05226,

work page arXiv
[8]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, 13 Published in Transactions on Machine Learning Research (05/2024) Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabh...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.ArXiv preprint, abs/2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Representations, ICLR 2021, V...

work page 2021
[11]

OpenReview.net, 2021a. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In9th International Conference on Learning Repres...

work page 2021
[12]

Complexity-based prompting for multi-step reasoning

OpenReview.net, 2021b. Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning.ArXiv preprint, abs/2210.00720,

work page arXiv
[13]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, and Hongsheng Li. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6639–6648. Computer Vision Foundation / IEEE,

work page 2019
[14]

Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and Furu Wei

doi: 10.1109/CVPR.2019.00680. Yaru Hao, Haoyu Song, Li Dong, Shaohan Huang, Zewen Chi, Wenhui Wang, Shuming Ma, and Furu Wei. Language models are general-purpose interfaces.ArXiv preprint, abs/2206.06336,

work page doi:10.1109/cvpr.2019.00680 2019
[15]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770–778. IEEE Computer Society,

work page 2016
[16]

Deep residual learning for image recognition

doi: 10.1109/CVPR.2016.90. Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers.ArXiv preprint, abs/2212.10071,

work page doi:10.1109/cvpr.2016.90 2016
[17]

arXiv preprint arXiv:2212.10403 , year=

Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey.ArXiv preprint, abs/2212.10403,

work page arXiv
[18]

UNIFIEDQA: Crossing format boundaries with a single QA system

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1896–1907, Online,

work page 2020
[19]

doi: 10.18653/v1/2020.findings-emnlp.171

Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.171. 14 Published in Transactions on Machine Learning Research (05/2024) Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Fu, Kyle Richardson, Peter Clark, and Ashish Sabharwal. Decomposed prompting: A modular approach for solving complex tasks.ArXiv preprint, abs/2210.02406,

work page doi:10.18653/v1/2020.findings-emnlp.171 2020
[20]

Bilinear attention networks

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.),Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Can...

work page 2018
[21]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. In Marina Meila and Tong Zhang (eds.),Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 ofProceedings of Machine Learning Research, pp. 5583–5594. PMLR,

work page 2021
[22]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.ArXiv preprint, abs/2205.11916,

work page internal anchor Pith review arXiv
[23]

On vision features in multimodal machine translation

Bei Li, Chuanhao Lv, Zefan Zhou, Tao Zhou, Tong Xiao, Anxiang Ma, and JingBo Zhu. On vision features in multimodal machine translation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6327–6337, Dublin, Ireland, 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.ac...

work page doi:10.18653/v1/2022.acl-long.438 2022
[24]

On the advance of making language models better reasoners.arXiv preprint arXiv:2206.02336, 2, 2022

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. On the advance of making language models better reasoners.ArXiv preprint, abs/2206.02336, 2022c. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.ArXiv preprint, abs/2304.08485,

work page arXiv
[25]

Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022a. Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Ra...

work page arXiv 2023
[26]

Teaching small language models to reason.ArXiv preprint, abs/2212.08410,

Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason.ArXiv preprint, abs/2212.08410,

work page arXiv
[27]

Gpt-4v(ision) system card,

15 Published in Transactions on Machine Learning Research (05/2024) OpenAI. Gpt-4v(ision) system card,

work page 2024
[28]

URL https://openreview.net/forum?id=lLmqxkfSIw. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.),Proceedings of...

work page 2021
[29]

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, ...

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Learning to retrieve prompts for in-context learning

Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2655–2671, Seattle, United States,

work page 2022
[32]

URL https://aclanthology.org/2022.naacl-main

Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.191. Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InComputer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Par...

work page doi:10.18653/v1/2022.naacl-main.191 2022
[33]

Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model.Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html,

work page 2023
[34]

LaMDA: Language Models for Dialog Applications

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, YaGuang Li, Hongrae Lee, Huaixiu Steven Zheng, Amin Ghafouri, Marcelo Menegali, Yanping Huang, Maxim Krikun, Dmitry Lepikhin, James Qin, Dehao Chen, Yuanzhong Xu, Zhifeng Chen, Adam Roberts, Maarten Bosma, Vincent...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.),Advances in Neural Information Processing Systems 30: Annual Conference on...

work page 2017
[36]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Boshi Wang, Xiang Deng, and Huan Sun. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 2714–2730, Abu Dhabi, United Arab Emirates, 2022a. Association for Computational Linguistics. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

doi: 10.18653/v1/2021.acl-long.480

Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.480. Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling.Proceedings of the 40th International Conference on Machine Learning, PMLR, pp. 39755–39769,

work page doi:10.18653/v1/2021.acl-long.480 2021
[38]

Deep modular co-attention networks for visual question answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. InIEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6281–6290. Computer Vision Foundation / IEEE,

work page 2019
[39]

doi: 10.1109/CVPR.2019.00644. Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding a...

work page doi:10.1109/cvpr.2019.00644 2019
[40]

arXiv preprint arXiv:2303.16199 , year=

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention.ArXiv preprint, abs/2303.16199, 2023a. Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’...

work page arXiv 2024
[41]

Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pp

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2023c. doi: 10.1109/TPAMI.2023.3234170. Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prom...

work page doi:10.1109/tpami.2023.3234170 2023
[42]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex reasoning in large language models. ArXiv preprint, abs/2205.10625,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Here, we present additional examples to illustrate this phenomenon, as depicted in Figure

18 Published in Transactions on Machine Learning Research (05/2024) A Extended Analysis for the Challenge of Multimodal-CoT A.1 Additional Examples of Misleading through Hallucinated Rationales Based on our case studies (Section 3.2), we have observed a tendency for the baseline model to generate hallucinated rationales. Here, we present additional exampl...

work page 2024
[44]

(C) The magnitude of the magnetic force is the same in both pairs. (C) neither; their concentrations are the same Figure 7: Examples of the two-stage framework without vision features (baseline) and with vision features (ours) for generating rationales and predicting answers. The upper part presents the problem details, and the lower part shows the output...

work page 2024
[45]

•ScienceQA is a large-scale multimodal science question dataset with annotated lectures and explanations

benchmark datasets. •ScienceQA is a large-scale multimodal science question dataset with annotated lectures and explanations. It contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills. The dataset is split into training, validation, and test splits with 12k, 4k, and 4k ques...

work page 2024
[46]

The vision features are obtained by the frozen ViT-large encoder (Dosovitskiy et al., 2021b)

(Section 6.1). The vision features are obtained by the frozen ViT-large encoder (Dosovitskiy et al., 2021b). Since using image captions can slightly improve model performance, as shown in Section 3.3, we append the image captions to the context following Lu et al. (2022a). The captions are generated by InstructBLIP (Dai et al., 2023). We fine-tune the mod...

work page 2023
[47]

Our experiments are run on 8 NVIDIA Tesla V100 32G GPUs. C Further Analysis C.1 Examples of Rationale Generation with Large Models A recent flame is to leverage large language models or large vision-language models to generate reasoning chains for multimodal question answering problems (Zhang et al., 2023a; Lu et al., 2023; Liu et al., 2023; Alayrac et al...

work page 2023
[48]

Then, we use the generated pseudo-rationales as the target rationales for training instead of relying on the human annotation of reasoning chains

(Figure 10a) and ChatGPT (Figure 10b) for zero-shot inference, respectively. Then, we use the generated pseudo-rationales as the target rationales for training instead of relying on the human annotation of reasoning chains. OutputThegreenarearepresentsthestateofNewHampshire,whichislocatedinthenortheasternregionoftheUnitedStates.Thereareseveralotherstatesv...

work page 2024
[49]

80% 14%6% CommonsenseLogicalOthers Figure 11: Categorization analysis

We examined 50 samples that yielded incorrect answers and categorized them accordingly. 80% 14%6% CommonsenseLogicalOthers Figure 11: Categorization analysis. The most prevalent error type is commonsense mistakes, accounting for 80% of the errors. These mistakes occur when the model is faced with questions that require commonsense knowledge, such as inter...

work page 2024