pith. machine review for the scientific record. sign in

arxiv: 2311.07575 · v1 · submitted 2023-11-13 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG
keywords multi-modal large language modeljoint mixingweight mixingvisual instruction tuningmulti-task learningvisual embeddingshigh-resolution image understanding
0
0 comments X

The pith

Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SPHINX by unfreezing the language model during pre-training and directly combining its weights with those of an LLM trained on synthetic data. It then mixes multiple visual instruction tasks such as question answering, region understanding, caption grounding, document layout detection, and pose estimation, each with tailored instructions to prevent interference. Visual features are pulled from several network architectures and pre-training methods at different granularities. These three mixing steps together yield stronger alignment and broader capabilities than single-source approaches. An additional high-resolution strategy that mixes scales and sub-images further improves fine-grained visual reasoning on existing benchmarks.

Core claim

By directly integrating weights from LLMs trained on real-world and synthetic data, jointly tuning on a curated set of visual instruction tasks with conflict-avoiding instructions, and extracting embeddings from multiple architectures and granularities, SPHINX attains superior multi-modal understanding across a wide range of applications while an auxiliary mixing of image scales enables strong high-resolution parsing.

What carries the argument

Joint mixing of model weights, tuning tasks, and visual embeddings, which directly combines parameters, instructions, and features from different sources to build one model.

If this is right

  • Unfreezing the LLM plus weight mixing produces stronger vision-language alignment than frozen baselines.
  • Task-specific instructions allow simultaneous training on region-level understanding, pose estimation, and document tasks without mutual degradation.
  • Diverse visual embeddings from multiple networks and pre-training regimes supply more robust image representations to the language model.
  • Mixing image scales and high-resolution sub-images yields improved fine-grained appearance capture on existing evaluation sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mixing principle could be tested on language-only or audio-visual models to check whether parameter-level integration generalizes beyond vision-language pairs.
  • If weight mixing succeeds here, it raises the possibility that separate large-scale pre-training runs on different data distributions can be combined post hoc rather than retrained from scratch.
  • Future variants might add a third weight source or additional task categories to probe the limits of conflict-free mixing.

Load-bearing premise

Directly integrating weights from LLMs trained on real-world and synthetic data will incorporate diverse semantics with robustness and without conflicts or performance loss.

What would settle it

If the weight-mixed model scores lower than either the real-world-only or synthetic-only LLM on standard vision-language benchmarks, the mixing step would be shown to introduce net conflicts rather than gains.

read the original abstract

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and design task-specific instructions to avoid inter-task conflict. In addition to the basic visual question answering, we include more challenging tasks such as region-level understanding, caption grounding, document layout detection, and human pose estimation, contributing to mutual enhancement over different scenarios. Additionally, we propose to extract comprehensive visual embeddings from various network architectures, pre-training paradigms, and information granularity, providing language models with more robust image representations. Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications. On top of this, we further propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images. With a mixing of different scales and high-resolution sub-images, SPHINX attains exceptional visual parsing and reasoning performance on existing evaluation benchmarks. We hope our work may cast a light on the exploration of joint mixing in future MLLM research. Code is released at https://github.com/Alpha-VLLM/LLaMA2-Accessory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SPHINX, a multi-modal large language model that performs joint mixing of LLM weights (unfreezing and integrating parameters from real-world and synthetic data LLMs), a variety of visual instruction tuning tasks (including VQA, region-level understanding, caption grounding, document layout detection, and pose estimation) with task-specific instructions, and visual embeddings extracted from diverse network architectures, pre-training paradigms, and granularities. It further proposes an efficient high-resolution strategy mixing scales and sub-images, claiming superior multi-modal understanding and visual parsing on benchmarks.

Significance. If the empirical gains hold after proper controls, the joint mixing framework could offer a practical route to versatile MLLMs that combine robustness and diversity without separate expert modules; the release of code is a positive contribution for reproducibility.

major comments (3)
  1. [weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.
  2. [joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.
  3. [high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.
minor comments (2)
  1. [visual embeddings section] Notation for the visual embedding extraction (various architectures and granularities) is introduced without a compact diagram or table summarizing the sources and dimensions, which would aid clarity.
  2. [abstract and experiments] The abstract states 'superior multi-modal understanding capabilities on a wide range of applications' but the manuscript should explicitly list the exact benchmarks and metrics used for each claim rather than referring generically to 'existing evaluation benchmarks'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed each major comment and provide point-by-point responses below, including clarifications and planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [weight mix strategy section] The central claim that 'directly integrating the weights from two domains' efficiently incorporates diverse semantics with favorable robustness (abstract and weight-mix description) is load-bearing for the superiority argument, yet the manuscript provides no explicit definition of the mixing operator (simple average, task-vector addition, or learned gate), no analysis of activation/gradient conflicts, and no ablation isolating the weight-mix step from task and embedding mixing.

    Authors: We agree that greater precision is needed here. The weight mixing is implemented as a direct parameter-wise average between the real-world and synthetic-data LLMs after a short alignment phase, as introduced in Section 3.2. We acknowledge the absence of an explicit formula, conflict analysis, and isolating ablation. In the revised manuscript we will add the mathematical definition of the operator, a short discussion of activation/gradient behavior, and a new ablation that holds task mixing and embedding mixing fixed while toggling only the weight-mix step. These additions will better substantiate the contribution of this component. revision: yes

  2. Referee: [joint visual instruction tuning section] The assertion that task-specific instructions alone suffice to avoid inter-task conflict during joint visual instruction tuning is presented without quantitative evidence of interference (e.g., performance drop when mixing all tasks vs. sequential) or comparison to standard multi-task baselines; this undermines the 'mutual enhancement' claim across scenarios such as region-level understanding and pose estimation.

    Authors: We appreciate this point. Task-specific instructions are used to condition the model on each task during joint training, as described in Section 4. However, we did not report a direct joint-versus-sequential comparison or a standard multi-task baseline. In the revision we will include new experiments that measure performance when all tasks are trained jointly with the proposed instructions versus sequential training and versus a vanilla multi-task baseline without task-specific prompts. These results will quantify interference (or its absence) and support the mutual-enhancement claim for tasks such as region-level understanding and pose estimation. revision: yes

  3. Referee: [high-resolution strategy section] The high-resolution strategy of mixing different scales and sub-images is claimed to attain exceptional visual parsing, but the manuscript lacks a controlled comparison showing that the gains exceed those from simply increasing input resolution or using standard multi-scale patching, making it unclear whether the mixing itself is the decisive factor.

    Authors: We thank the referee for this observation. Our high-resolution approach mixes multi-scale inputs with selected high-resolution sub-images to balance detail and efficiency. We recognize that a controlled comparison against simply raising resolution or using conventional multi-scale patching is missing. In the revised version we will add such experiments, reporting performance when using our mixing strategy versus equivalent higher-resolution inputs and versus standard multi-scale feature extraction, thereby isolating the benefit of the proposed mixing procedure. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical mixing of existing components with no derivations or self-referential claims

full rationale

The paper describes SPHINX as an empirical construction: unfreezing the LLM and directly integrating weights from real-world and synthetic LLMs, mixing diverse tasks with task-specific instructions, and extracting visual embeddings from multiple architectures. No equations, derivations, predictions, or uniqueness theorems appear in the provided text. Claims of superior performance rest on experimental benchmarks rather than any reduction of outputs to fitted inputs or self-citations by construction. The approach is self-contained as a practical combination of prior components without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the empirical effectiveness of weight interpolation between differently trained LLMs and the assumption that task-specific instructions prevent interference; no explicit free parameters or invented entities are named in the abstract.

axioms (2)
  • domain assumption Weight mixing between LLMs trained on real-world and synthetic data produces a model with diverse semantics and robustness
    Invoked in the first paragraph of the abstract as the basis for stronger vision-language alignment.
  • domain assumption Task-specific instructions prevent inter-task conflict during joint visual instruction tuning
    Stated when describing the mixing of tasks including region-level understanding and pose estimation.

pith-pipeline@v0.9.0 · 5647 in / 1384 out tokens · 41557 ms · 2026-05-17T02:58:20.956133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

    cs.CL 2023-11 unverdicted novelty 8.0

    MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.

  2. Aligned Multi-View Scripts for Universal Chart-to-Code Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    Introduces an aligned multi-language dataset and a language-conditioned low-rank adapter for generating executable plotting code in Python, R, and LaTeX from chart images.

  3. AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

  4. VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

    cs.IR 2024-10 conditional novelty 7.0

    VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

  5. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    cs.CV 2024-07 unverdicted novelty 7.0

    LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving ...

  6. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

    cs.CV 2024-03 conditional novelty 7.0

    MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.

  7. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  8. Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Foveated Reasoner integrates foveation as stateful actions inside the autoregressive decoding loop of vision-language models, trained via cold-start supervision then reinforcement learning to achieve higher accuracy a...

  9. SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    cs.LG 2025-06 unverdicted novelty 6.0

    SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

  10. SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.

  11. MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

    cs.CV 2024-03 unverdicted novelty 6.0

    MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

  12. TempCompass: Do Video LLMs Really Understand Videos?

    cs.CV 2024-03 unverdicted novelty 6.0

    TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

  13. CogVLM: Visual Expert for Pretrained Language Models

    cs.CV 2023-11 conditional novelty 6.0

    CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.

  14. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    cs.CV 2023-06 unverdicted novelty 6.0

    MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.

  15. Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...

  16. Hallucination of Multimodal Large Language Models: A Survey

    cs.CV 2024-04 accept novelty 5.0

    The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.

  17. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    cs.CV 2024-04 unverdicted novelty 4.0

    InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.

  18. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 18 Pith papers · 22 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023a. Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Eval...

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

    URL https://www.adept.ai/ blog/fuyu-8b. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

    Jun Chen, Deyao Zhu1 Xiaoqian Shen1 Xiang Li, Zechun Liu2 Pengchuan Zhang, Raghuraman Krishnamoorthi2 Vikas Chandra2 Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a. Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and ...

  4. [4]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Albert Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500,

  5. [5]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  6. [6]

    Dreamllm: Synergistic multimodal com- prehension and creation

    18 Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation. arXiv preprint arXiv:2309.11499,

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

  8. [8]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023a. Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, X...

  9. [9]

    Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following

    Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Ke Chen, Peng Gao, Xianzhi Li, Hongsheng Li, and Pheng-Ann Heng. Point-bind & point-llm: Aligning point cloud with multi-modality for 3d understanding, generation, and instruction following. ArXiv, abs/2309.00615,

  10. [10]

    Danna Gurari, Qing Li, Abigale Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3608–3617,

  11. [11]

    Imagebind-llm: Multi-modality instruction tuning

    Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, et al. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905,

  12. [12]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vis...

  13. [13]

    Language Is Not All You Need: Aligning Perception with Language Models

    Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045,

  14. [14]

    Hudson and Christopher D

    19 Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6693–6702,

  15. [15]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980,

  16. [16]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643,

  17. [17]

    Li, and Ziwei Liu

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, C. Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. ArXiv, abs/2306.05425, 2023a. Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. ArXiv, abs/2305.03726, 2023b. Boha...

  18. [18]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023d. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint ...

  19. [19]

    Improved Baselines with Visual Instruction Tuning

    Fangyu Liu, Guy Edward Toh Emerson, and Nigel Collier. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023a. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744, 2023b. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instru...

  20. [20]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun yue Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating math reasoning in visual contexts with gpt-4v, bard, and other large multimodal models. ArXiv, abs/2310.02255,

  21. [21]

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana-Maria Camburu, Alan Loddon Yuille, and Kevin P. Murphy. Generation and comprehension of unambiguous object descriptions. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11–20,

  22. [22]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3190–3199,

  23. [23]

    Ocr-vqa: Visual question answering by reading text in images

    Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952,

  24. [24]

    OpenAI. Chatgpt. https://chat.openai.com, 2023a. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023b. Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint...

  25. [25]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116,

  26. [26]

    Instruction Tuning with GPT-4

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a. Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023b. Bryan A. Plummer, ...

  27. [27]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114,

  28. [28]

    Tiny lvlm-ehub: Early multimodal experiments with bard

    Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729,

  29. [29]

    Textcaps: a dataset for image captioning with reading comprehension

    Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. ArXiv, abs/2003.12462,

  30. [30]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8309–8318,

  31. [31]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. Pandagpt: One model to instruction-follow them all. ArXiv, abs/2305.16355,

  32. [32]

    Resolution- robust large mask inpainting with fourier convolutions

    Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution- robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161,

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, K...

  34. [34]

    Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning

    Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, and Hongsheng Li. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. arXiv preprint arXiv:2310.03731, 2023a. 22 Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and...

  35. [35]

    Visionllm: Large language model is also an open-ended decoder for vision-centric tasks

    Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023b. Song Wen, Guian Fang, Renrui Zhang, Peng Gao, Hao Dong, and Dimitris Metaxas. Improv- ing compositional text-to...

  36. [36]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Vi- sual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671,

  37. [37]

    Pointllm: Empowering large language models to understand point clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. Pointllm: Empowering large language models to understand point clouds. ArXiv, abs/2308.16911,

  38. [38]

    Yan, Yi Jiang, Jiannan Wu, D

    B. Yan, Yi Jiang, Jiannan Wu, D. Wang, Ping Luo, Zehuan Yuan, and Huchuan Lu. Universal instance perception as object discovery and retrieval. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15325–15336,

  39. [39]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381,

  40. [40]

    mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

    Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178,

  41. [41]

    Inpaint anything: Segment anything meets image inpainting,

    Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023a. Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabil...

  42. [42]

    Mmicl: Empowering vision-language model with multi-modal in-context learning

    Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915,

  43. [43]

    Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification

    Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921,

  44. [44]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    23 Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592,

  45. [45]

    Pointclip v2: Adapting clip for powerful 3d open-world learning

    Xiangyang Zhu, Renrui Zhang, Bowei He, Ziyao Zeng, Shanghang Zhang, and Peng Gao. Pointclip v2: Adapting clip for powerful 3d open-world learning. arXiv preprint arXiv:2211.11682,