pith. machine review for the scientific record. sign in

arxiv: 2407.04973 · v1 · submitted 2024-07-06 · 💻 cs.AI · cs.CL· cs.CV· cs.LG

Recognition: 2 theorem links

· Lean Theorem

LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts

Authors on Pith no claims yet

Pith reviewed 2026-05-16 02:42 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CVcs.LG
keywords LogicVistamultimodal LLMslogical reasoningvisual contextsbenchmarkevaluationreasoning capabilitiesMLLM
0
0 comments X

The pith

LogicVista provides a benchmark of 448 visual questions to evaluate logical reasoning in multimodal LLMs across five tasks and nine capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogicVista to address the lack of systematic tests for logical reasoning in multimodal large language models when images are present. It covers five tasks that together test nine capabilities through 448 multiple-choice questions, each paired with a correct answer and human-written reasoning steps. This setup allows evaluation in both multiple-choice and open-ended formats. A sympathetic reader would care because logical reasoning combined with visuals underpins activities such as navigation and puzzle-solving, skills that current model assessments largely ignore. The authors test eight existing MLLMs on the benchmark and release the questions and annotations.

Core claim

LogicVista assesses the integrated logical reasoning capabilities of MLLMs in visual contexts across 5 logical reasoning tasks encompassing 9 different capabilities using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation.

What carries the argument

The LogicVista benchmark, a set of 448 image-based multiple-choice questions with human reasoning annotations that test logical cognition in visual settings.

Load-bearing premise

The 448 questions and their human-written reasoning annotations accurately and comprehensively capture general logical cognition abilities in visual contexts without significant selection bias or coverage gaps.

What would settle it

A demonstration that high-scoring models on LogicVista fail at comparable logical tasks with new images or real-world visual scenarios would show the benchmark does not measure general visual-logic abilities.

read the original abstract

We propose LogicVista, an evaluation benchmark that assesses the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in Visual contexts. Recent advancements in MLLMs have demonstrated various fascinating abilities, from crafting poetry based on an image to performing mathematical reasoning. However, there is still a lack of systematic evaluation of MLLMs' proficiency in logical reasoning tasks, which are essential for activities like navigation and puzzle-solving. Thus we evaluate general logical cognition abilities across 5 logical reasoning tasks encompassing 9 different capabilities, using a sample of 448 multiple-choice questions. Each question is annotated with the correct answer and the human-written reasoning behind the selection, enabling both open-ended and multiple-choice evaluation. A total of 8 MLLMs are comprehensively evaluated using LogicVista. Code and Data Available at https://github.com/Yijia-Xiao/LogicVista.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LogicVista, a new benchmark for assessing the integrated logical reasoning capabilities of multimodal large language models (MLLMs) in visual contexts. It consists of 448 multiple-choice questions spanning 5 logical reasoning tasks that together cover 9 distinct capabilities. Each question includes the correct answer and human-written reasoning annotations to support both multiple-choice and open-ended evaluation. The authors evaluate 8 MLLMs on the benchmark and release the code and data.

Significance. If the questions are shown to be representative and free of major selection bias, LogicVista would fill a clear gap by providing a systematic visual-context benchmark for logical reasoning, an area where current MLLM evaluations remain limited. The release of code, data, and human reasoning annotations is a clear strength that supports reproducibility and further research.

major comments (2)
  1. [Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.
  2. [Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.
minor comments (2)
  1. [Abstract and §3] The abstract states 'a sample of 448 multiple-choice questions' but does not clarify whether this is the full benchmark size or a subset; this should be stated explicitly in the main text.
  2. [Figures and Tables] Figure captions and table headers should explicitly list the 5 tasks and 9 capabilities to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on LogicVista. We agree that additional methodological details are needed to support the claims of comprehensive coverage and annotation reliability. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Benchmark Construction] The central claim that the 448 questions comprehensively cover the 9 capabilities across 5 tasks without significant selection bias or gaps is not supported by sufficient methodological detail. The manuscript provides no quantitative breakdown (e.g., number of questions per capability or task), sampling strategy, visual diversity metrics, or validation against external logical-reasoning taxonomies in the benchmark-construction section.

    Authors: We acknowledge that the current manuscript lacks these details in the benchmark-construction section. In the revised version, we will expand this section to include: a table reporting the exact number of questions per task and per capability (totaling 448), a description of the sampling strategy (stratified selection to ensure balanced coverage of the 9 capabilities without over-representation), quantitative visual diversity metrics (e.g., distribution across image sources, types, and complexity levels), and explicit mapping/validation against established logical-reasoning taxonomies from cognitive science to demonstrate coverage and minimize gaps or bias. revision: yes

  2. Referee: [Annotation Process] The human-written reasoning annotations are presented as enabling reliable open-ended evaluation, yet no inter-annotator agreement statistics or validation procedure for these annotations are reported, which is load-bearing for claims about the benchmark's utility beyond multiple-choice accuracy.

    Authors: We agree that inter-annotator agreement statistics and validation details are necessary to substantiate the reliability of the human-written reasoning annotations for open-ended evaluation. In the revision, we will add a dedicated subsection describing the annotation process (including annotator qualifications and guidelines), report inter-annotator agreement metrics (e.g., Fleiss' kappa across reasoning steps), and outline the validation procedure (e.g., review rounds for consistency and accuracy). This will directly support the benchmark's utility claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct benchmark construction and evaluation with no derivations or self-referential steps

full rationale

The paper introduces LogicVista as a new benchmark dataset of 448 multiple-choice questions with human annotations, then directly evaluates 8 MLLMs on it across 5 tasks and 9 capabilities. No equations, parameter fitting, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The central claim rests on the explicit creation and application of the dataset rather than any reduction of results to prior fitted values or self-defined constructs. This is a standard empirical benchmark paper with no mathematical derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks and questions validly measure logical reasoning abilities; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The five logical reasoning tasks and nine capabilities adequately represent general logical cognition in visual contexts.
    Invoked in the abstract when defining the benchmark scope without further justification or validation details.

pith-pipeline@v0.9.0 · 5455 in / 1068 out tokens · 48400 ms · 2026-05-16T02:42:19.597729+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

    cs.CV 2026-05 unverdicted novelty 7.0

    RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

  2. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  3. DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    cs.CV 2025-05 unverdicted novelty 7.0

    DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

  4. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  5. Anisotropic Modality Align

    cs.MM 2026-05 unverdicted novelty 6.0

    Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

  6. Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

    cs.LG 2026-05 unverdicted novelty 6.0

    Uni-OPD unifies on-policy distillation across LLMs and MLLMs with dual-perspective strategies that promote student exploration and enforce order-consistent teacher supervision based on outcome rewards.

  7. Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

  8. Evian: Towards Explainable Visual Instruction-tuning Data Auditing

    cs.CV 2026-04 unverdicted novelty 6.0

    EVian decomposes vision-language model responses into three cognitive components and audits them along consistency, coherence, and accuracy axes, showing that a small curated subset outperforms much larger training sets.

  9. Visually-Guided Policy Optimization for Multimodal Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    VGPO introduces visual attention compensation and dual-grained advantage re-weighting to reinforce visual focus in VLMs, yielding better activation and performance on multimodal reasoning tasks.

  10. Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    PGPO uses KL divergence to quantify token visual dependency and reshapes advantages in RLVR to amplify signals for visually grounded tokens, yielding 18.7% average gains on seven benchmarks.

  11. MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

    cs.LG 2026-02 unverdicted novelty 6.0

    MapTab benchmark shows current MLLMs struggle with multi-criteria multimodal route planning and that combining vision and language frequently underperforms single-modality approaches.

  12. Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

    cs.CV 2026-02 unverdicted novelty 6.0

    ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.

  13. DeepEyesV2: Toward Agentic Multimodal Model

    cs.CV 2025-11 unverdicted novelty 6.0

    DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.

  14. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  15. InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    cs.CV 2025-04 conditional novelty 6.0

    InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

  16. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    cs.CL 2024-11 conditional novelty 6.0

    Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

  17. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    cs.CV 2026-04 unverdicted novelty 5.0

    GLM-5V-Turbo integrates multimodal perception as a core part of reasoning and execution for agentic tasks, reporting strong results in visual tool use and multimodal coding while keeping text-only performance competitive.

  18. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  19. Seed1.8 Model Card: Towards Generalized Real-World Agency

    cs.AI 2026-03 unverdicted novelty 5.0

    Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.

  20. MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

    cs.LG 2025-09 unverdicted novelty 5.0

    An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.

  21. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    cs.CV 2026-04 unverdicted novelty 4.0

    GLM-5V-Turbo integrates multimodal perception directly into reasoning, planning, tool use, and execution for agents, yielding strong results in multimodal coding and framework-based tasks while keeping text coding com...

  22. GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

    cs.CV 2026-04 unverdicted novelty 4.0

    GLM-5V-Turbo integrates multimodal perception directly into reasoning and agent workflows, reporting strong results on visual tool use, multimodal coding, and framework-based agent tasks while keeping text coding competitive.

  23. Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

    cs.CL 2026-04 unverdicted novelty 4.0

    A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 21 Pith papers

  1. [1]

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

  2. [2]

    Flamingo: a visual language model for few-shot learning, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  3. [4]

    Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023

  4. [5]

    A survey on multimodal large language models, 2023

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models, 2023

  5. [6]

    Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023

  6. [7]

    Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering, 2023

  7. [8]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015

  8. [10]

    Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

    Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023

  9. [11]

    Wavering

    Michael J. Wavering. Logical reasoning necessary to make line graphs. Journal of Research in Science Teaching, 26(5):373–379, May 1989

  10. [12]

    Somerville

    Catherine Sophian and Susan C. Somerville. Early developments in logical reasoning: Considering alternative possibilities. Cognitive Development, 3(2):183–222, 1988

  11. [13]

    Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019

    Hugo Bronkhorst, Gerrit Roorda, Cor Suhre, and Martin Goedhart. Logical reasoning in formal and everyday reasoning tasks - international journal of science and mathematics education, Dec 2019

  12. [14]

    Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024

  13. [15]

    Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2017

  14. [16]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context , page 740–755. Springer International Publishing, 2014. 13

  15. [17]

    Textcaps: a dataset for image captioning with reading comprehension, 2020

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension, 2020

  16. [18]

    Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024

    Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, and Nanyun Peng. Contextual: Evaluating context-sensitive text-rich visual reasoning in large multimodal models, 2024

  17. [19]

    Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023

    Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, and Ludwig Schmidt. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use, 2023

  18. [20]

    Lawrence Zitnick

    Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server, 2015

  19. [21]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017

  20. [22]

    Vilbert: Pretraining task-agnostic visiolinguis- tic representations for vision-and-language tasks, 2019

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguis- tic representations for vision-and-language tasks, 2019

  21. [23]

    Uniter: Universal image-text representation learning, 2020

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning, 2020

  22. [24]

    Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020

    Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. Oscar: Object-semantics aligned pre-training for vision-language tasks, 2020

  23. [25]

    Vilt: Vision-and-language transformer without convolution or region supervision, 2021

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision, 2021

  24. [26]

    Simvlm: Simple visual language model pretraining with weak supervision, 2022

    Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision, 2022

  25. [27]

    Git: A generative image-to-text transformer for vision and language, 2022

    Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language, 2022

  26. [28]

    Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022

    Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng Liu, Yumao Lu, and Lijuan Wang. Unitab: Unifying text and box outputs for grounded vision-language modeling, 2022

  27. [29]

    Vision-language pre-training: Basics, recent advances, and future trends, 2022

    Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, and Jianfeng Gao. Vision-language pre-training: Basics, recent advances, and future trends, 2022

  28. [30]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  29. [31]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  30. [32]

    Llama: Open and efficient foundation language models, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023

  31. [33]

    Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models, 2021

  32. [34]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied ...

  33. [35]

    Opt: Open pre-trained transformer language models, 2022

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, and Luke Zettlemoyer. Opt: Open pre-trained transformer language models, 2022

  34. [36]

    Instruction tuning with gpt-4, 2023

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4, 2023

  35. [37]

    Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for training large autoregressive vision-language models, 2023

  36. [38]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

  37. [39]

    Otter: A multi-modal model with in-context instruction tuning, 2023

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning, 2023

  38. [40]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

  39. [41]

    Multimodal-gpt: A vision and language model for dialogue with humans, 2023

    Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans, 2023

  40. [42]

    mplug-owl: Modularization empowers large language models with multimodality, 2023

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu, Hehong Chen, Junfeng Tian, Qian Qi, Ji Zhang, and Fei Huang. mplug-owl: Modularization empowers large language models with multimodality, 2023

  41. [43]

    Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023

  42. [44]

    Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging- gpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023. 15

  43. [45]

    Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023

    Difei Gao, Lei Ji, Luowei Zhou, Kevin Qinghong Lin, Joya Chen, Zihan Fan, and Mike Zheng Shou. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn, 2023

  44. [46]

    nocaps: novel object captioning at scale

    Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) . IEEE, October 2019

  45. [47]

    Towards vqa models that can read, 2019

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019

  46. [48]

    Tap: Text-aware pre-training for text-vqa and text-caption, 2020

    Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo Luo. Tap: Text-aware pre-training for text-vqa and text-caption, 2020

  47. [49]

    From recognition to cognition: Visual commonsense reasoning, 2019

    Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning, 2019

  48. [50]

    Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge, 2019

  49. [51]

    Mmbench: Is your multi-modal model an all-around player?, 2023

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023

  50. [52]

    Can large language models be an alternative to human evaluations?, 2023

    Cheng-Han Chiang and Hung yi Lee. Can large language models be an alternative to human evaluations?, 2023

  51. [53]

    G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023

  52. [54]

    Gptscore: Evaluate as you desire, 2023

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. Gptscore: Evaluate as you desire, 2023

  53. [55]

    Mm-soc: Benchmarking multimodal large language models in social media platforms

    Yiqiao Jin, Minje Choi, Gaurav Verma, Jindong Wang, and Srijan Kumar. Mm-soc: Benchmarking multimodal large language models in social media platforms. In ACL, 2024

  54. [56]

    Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities

    Mina Lee, Percy Liang, and Qian Yang. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In CHI Conference on Human Factors in Computing Systems, CHI ’22. ACM, April 2022

  55. [57]

    Zhang, Mark Harman, and Meng Wang

    Shuyin Ouyang, Jie M. Zhang, Mark Harman, and Meng Wang. Llm is like a box of chocolates: the non-determinism of chatgpt in code generation, 2023

  56. [58]

    Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

  57. [59]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

  58. [60]

    spending on IT hardware will decline

    Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding, 2023. 16 Appendix: LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts A Examples of LogicVista Logi...