Recognition: 2 theorem links
· Lean TheoremInternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Pith reviewed 2026-05-17 05:24 UTC · model grok-4.3
The pith
InternLM-XComposer2 generates custom interleaved text-image content by applying LoRA parameters only to image tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
InternLM-XComposer2 demonstrates that applying additional LoRA parameters exclusively to image tokens produces a model capable of high-quality free-form text-image composition and comprehension, outperforming existing multimodal models and matching or surpassing GPT-4V and Gemini Pro on certain benchmarks while preserving the integrity of the pre-trained language knowledge.
What carries the argument
Partial LoRA (PLoRA) that applies LoRA parameters exclusively to image tokens to balance precise vision understanding with literary-quality text composition.
If this is right
- The model can produce long, interleaved multimodal documents from outlines or reference images.
- Vision-language understanding reaches or exceeds GPT-4V and Gemini Pro levels on selected evaluations.
- High-quality content creation becomes possible without full fine-tuning of the language backbone.
- The same PLoRA pattern may extend to other base language models of similar size.
Where Pith is reading between the lines
- Selective tuning of vision components could reduce the risk of language degradation seen in full multimodal fine-tuning.
- This separation of adaptation might allow smaller teams to build capable multimodal systems on top of existing open language models.
- Testing PLoRA on tasks that require very long context or creative writing would clarify how far the preserved language skill extends.
Load-bearing premise
Adding LoRA parameters only to image tokens preserves the original language model's knowledge while still enabling strong vision understanding and text-image generation.
What would settle it
A measurable drop in performance on pure language-only benchmarks after PLoRA training would show that language knowledge was not preserved.
read the original abstract
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents InternLM-XComposer2, a 7B-parameter vision-language model built on InternLM2 that introduces Partial LoRA (PLoRA) to apply additional LoRA parameters exclusively to image tokens. This design is claimed to preserve the base model's pre-trained language knowledge while enabling high-quality free-form interleaved text-image generation and comprehension from inputs such as outlines, textual specifications, and reference images. The manuscript reports that the model significantly outperforms prior multimodal systems and matches or exceeds GPT-4V and Gemini Pro on selected vision-language benchmarks, with the model weights publicly released.
Significance. If the central performance claims and the PLoRA preservation hypothesis are substantiated, the work would be significant for providing a lightweight, modular route to extend strong language models into multimodal composition tasks without full fine-tuning. The public release of the 7B model series would further enable reproducible research on controllable text-image generation.
major comments (1)
- [§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.
minor comments (1)
- [Abstract] Abstract and experimental section: The superiority claims reference various benchmarks but provide no details on data splits, evaluation protocols, statistical significance, or exact metric definitions, making it difficult to assess the strength of the reported gains over baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and describe the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 and abstract: The central design claim that PLoRA (LoRA applied only to image tokens) preserves InternLM2-7B's pre-trained language knowledge while adding vision capabilities is asserted without supporting ablation evidence. No results are shown for language-only benchmarks (e.g., MMLU, GSM8K) before versus after PLoRA, nor any direct comparison of PLoRA versus standard LoRA applied to all tokens. Because cross-attention layers still mix modalities, this isolation assumption is not guaranteed by architecture alone and is load-bearing for the claimed balance between vision understanding and literary text composition.
Authors: We appreciate the referee highlighting the need for stronger empirical support. The PLoRA design applies LoRA updates exclusively to image tokens while keeping base InternLM2 weights frozen for text tokens, which is intended to limit interference with pre-trained language abilities. We agree that direct ablations would strengthen the manuscript. In the revision we will add language-only benchmark results (MMLU, GSM8K) comparing the original InternLM2-7B to the PLoRA-adapted model to quantify preservation. We will also include a side-by-side comparison of PLoRA versus standard LoRA applied to all tokens, showing advantages for text composition quality. Regarding modality mixing through attention layers, although cross-modal interactions exist, the position-specific LoRA application ensures that core language parameters and the modeling head for pure text sequences remain unchanged, which is consistent with the observed high-quality long-text generation performance. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external benchmarks
full rationale
The paper introduces InternLM-XComposer2 with a Partial LoRA (PLoRA) mechanism applied selectively to image tokens on top of InternLM2-7B. Central claims of superior free-form text-image composition and comprehension are supported by reported experimental results on various benchmarks, including direct comparisons to GPT-4V and Gemini Pro. No equations, self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the reported outcomes to the paper's own inputs by construction appear in the provided text. The approach is presented as an architectural proposal with empirical validation against independent external references.
Axiom & Free-Parameter Ledger
free parameters (1)
- Partial LoRA rank and scaling
Forward citations
Cited by 18 Pith papers
-
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
MMMU provides 11.5K heterogeneous college-level multimodal questions that current models solve at 56-59% accuracy, establishing a new standard for expert multimodal evaluation.
-
MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models
MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.
-
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
-
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
-
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
Towards Design Compositing
GIST is a training-free identity-preserving image compositor that improves visual harmony when integrating disparate elements into design pipelines.
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA unifies SAM-2 segmentation with MLLM reasoning into a single model for referring segmentation and conversation on images and videos, supported by a new 72k-expression Ref-SAV dataset.
-
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
-
BLINK: Multimodal Large Language Models Can See but Not Perceive
BLINK benchmark shows multimodal LLMs reach only 45-51 percent accuracy on core visual perception tasks where humans achieve 95 percent, indicating these abilities have not emerged.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Less Detail, Better Answers: Degradation-Driven Prompting for VQA
Degradation-Driven Prompting improves VQA by intentionally reducing image detail and using masks, lines, and examples to guide models toward essential structures.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
-
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
DeepSeek-VL: Towards Real-World Vision-Language Understanding
DeepSeek-VL develops open-source 1.3B and 7B vision-language models that achieve competitive or state-of-the-art results on real-world visual-language benchmarks through diverse data curation, a hybrid vision encoder,...
Reference graph
Works this paper leans on
-
[1]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF inter- national conference on computer vision, pages 8948–8957,
-
[2]
Flamingo: a visual language model for few-shot learning,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...
-
[3]
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel- Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solv- ing with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019. 4
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[4]
Lawrence Zitnick, and Devi Parikh
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In International Conference on Computer Vision (ICCV), 2015. 4
work page 2015
-
[5]
Openflamingo: An open- source framework for training large autoregressive vision- language models
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Worts- man, and Ludwig Schmidt. Openflamingo: An open- source framework for training large autoregressive vision- language models. arXiv.org, 2023. 3
work page 2023
-
[6]
Qwen-vl: A frontier large vision-language model with versatile abilities
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv.org, 2023. 2, 3
work page 2023
-
[7]
Baichuan 2: Open large-scale language models
Baichuan. Baichuan 2: Open large-scale language models. arXiv.org, 2023. 2, 3
work page 2023
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neu- ral Information Processing Systems (NeurIPS) , 33:1877– 1901, 2020. 2
work page 1901
-
[9]
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Shikra: Unleashing multimodal llm’s referential dialogue magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv.org, 2023. 3
work page 2023
-
[11]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 3, 4 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Pali-x: On scaling up a multilingual vision and language model, 2023
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shak- eri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias ...
work page 2023
-
[13]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. Microsoft coco captions: Data collection and eval- uation server, 2015. 4
work page 2015
-
[14]
Pali-3 vision language models: Smaller, faster, stronger, 2023
Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul V oigtlaender, Basil Mustafa, Sebas- tian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, and Radu Soricut. Pali-3 vision language models: Smaller, faster, stronger, 2023. 3
work page 2023
-
[15]
Pali: A jointly-scaled multilingual language- image model, 2023
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergio- vanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Has- san Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, ...
work page 2023
-
[16]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 2, 4
work page 2023
-
[17]
Palm: Scaling language modeling with pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv.org, 2022. 2
work page 2022
-
[18]
Opencompass: A univer- sal evaluation platform for foundation models
OpenCompass Contributors. Opencompass: A univer- sal evaluation platform for foundation models. https: / / github . com / open - compass / opencompass,
-
[19]
QWen Contributors. Qwen-vl-plus. https : / / huggingface . co / spaces / Qwen / Qwen - VL - Plus, year=2023. 2
work page 2023
-
[20]
Xtuner: A toolkit for efficiently fine-tuning llm
XTuner Contributors. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/ xtuner, 2023. 3
work page 2023
-
[21]
Instructblip: Towards general- purpose vision-language models with instruction tuning,
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,
-
[22]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. arXiv.org, 2018. 2
work page 2018
-
[23]
Dreamllm: Synergistic multimodal com- prehension and creation
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, Xiangwen Kong, Xiangyu Zhang, Kaisheng Ma, and Li Yi. Dreamllm: Synergistic multimodal com- prehension and creation. arXiv preprint arXiv:2309.11499,
-
[24]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duck- worth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Glm: General language model pretraining with autoregressive blank infilling
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 320–335, 2022. 2
work page 2022
-
[26]
Eva: Exploring the limits of masked visual represen- tation learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual represen- tation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19358–19369, 2023. 3
work page 2023
-
[27]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jin- rui Yang, Xiawu Zheng, Ke Li, Xing Sun, and Ron- grong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
A challenger to gpt-4v? early explorations of gemini in visual expertise
Chaoyou Fu, Renrui Zhang, Zihan Wang, Yubo Huang, Zhengye Zhang, Longtian Qiu, Gaoxiang Ye, Yunhang Shen, Mengdan Zhang, Peixian Chen, Sirui Zhao, Shao- hui Lin, Deqiang Jiang, Di Yin, Peng Gao, Ke Li, Hong- sheng Li, and Xing Sun. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023. 3
-
[29]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, W. Zhang, Pan Lu, Conghui He, Xi- angyu Yue, Hongsheng Li, and Yu Jiao Qiao. Llama- adapter v2: Parameter-efficient visual instruction model. ArXiv, abs/2304.15010, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Planting a seed of vision in large language model
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. 3
-
[31]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models, 2023. 2, 5
work page 2023
-
[32]
Wan- 8 juan: A comprehensive multimodal dataset for advancing english and chinese large models
Conghui He, Zhenjiang Jin, Chaoxi Xu, Jiantao Qiu, Bin Wang, Wei Li, Hang Yan, Jiaqi Wang, and Da Lin. Wan- 8 juan: A comprehensive multimodal dataset for advancing english and chinese large models. ArXiv, abs/2308.10755,
-
[33]
LoRA: Low-rank adaptation of large language mod- els
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language mod- els. In International Conference on Learning Representa- tions, 2022. 2, 3
work page 2022
-
[34]
Qidong Huang, Xiaoyi Dong, Pan Zhang, Bin Wang, Con- ghui He, Jiaqi Wang, Dahua Lin, Weiming Zhang, and Nenghai Yu. Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911,
-
[35]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 4
work page 2019
-
[36]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 4904–4916. PMLR, 2021. 3
work page 2021
-
[37]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lam- ple, Lucile Saulnier, L ´elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth ´ee Lacroix, and William El Sayed. Mistral 7b, 2023. 2
work page 2023
-
[38]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Dvqa: Understanding data visualizations via ques- tion answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via ques- tion answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656,
-
[40]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–
work page 2016
-
[41]
Springer, 2016. 2, 4, 5
work page 2016
-
[42]
Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix- iao Ge, and Ying Shan. Seed-bench: Benchmarking multi- modal llms with generative comprehension, 2023. 2, 5
work page 2023
-
[43]
Otter: A multi-modal model with in-context instruction tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi-modal model with in-context instruction tuning. arXiv.org, 2023. 3
work page 2023
-
[44]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In Pro- ceedings of the International Conference on Machine learn- ing (ICML), pages 12888–12900. PMLR, 2022. 3
work page 2022
-
[45]
Grounded language-image pre-training
Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 3
work page 2022
-
[46]
Evaluating object hallucination in large vision-language models, 2023
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. 2, 5
work page 2023
-
[47]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023. 3
-
[48]
Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning
Fuxiao Liu, Xiaoyang Wang, Wenlin Yao, Jianshu Chen, Kaiqiang Song, Sangwoo Cho, Yaser Yacoob, and Dong Yu. Mmc: Advancing multimodal chart understand- ing with large-scale instruction tuning. arXiv preprint arXiv:2311.10774, 2023. 4
-
[49]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv.org, 2023. 2, 3, 4, 5
work page 2023
-
[51]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv.org, 2023. 3
work page 2023
-
[52]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhnag, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mm- bench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual con- texts. In International Conference on Learning Represen- tations (ICLR), 2024. 2, 5
work page 2024
-
[54]
Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning
Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu. Inter-gps: In- terpretable geometry problem solving with formal language and symbolic reasoning. In The 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021. 4
work page 2021
-
[55]
Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal rea- soning via thought chains for science question answer- ing. Advances in Neural Information Processing Systems , 35:2507–2521, 2022. 4
work page 2022
-
[56]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019. 4
work page 2019
-
[57]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022. 2, 4, 5 9
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
OpenAI. Chatgpt. https://openai.com/blog/ chatgpt, 2022. 2
work page 2022
- [59]
-
[60]
Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems (NIPS), 2011. 4
work page 2011
-
[61]
Training language models to follow instructions with human feed- back
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feed- back. Advances in Neural Information Processing Systems (NeurIPS), 35:27730–27744, 2022. 2
work page 2022
-
[62]
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobei- dli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Lau- nay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. 3
work page 2023
-
[63]
Kosmos-2: Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv.org,
-
[64]
Gpt4point: A unified framework for point-language under- standing and generation, 2023
Zhangyang Qi, Ye Fang, Zeyi Sun, Xiaoyang Wu, Tong Wu, Jiaqi Wang, Dahua Lin, and Hengshuang Zhao. Gpt4point: A unified framework for point-language under- standing and generation, 2023. 3
work page 2023
-
[65]
Zhangyang Qi, Ye Fang, Mengchen Zhang, Zeyi Sun, Tong Wu, Ziwei Liu, Dahua Lin, Jiaqi Wang, and Hengshuang Zhao. Gemini vs gpt-4v: A preliminary comparison and combination of vision-language models through qualitative cases, 2023. 3
work page 2023
-
[66]
Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023
Qwen. Introducing qwen-7b: Open foundation and human- aligned models (of the state-of-the-arts), 2023. 2
work page 2023
-
[67]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. In Proceedings of the International Conference on Machine learning (ICML), pages 8748–8763. PMLR, 2021. 3
work page 2021
-
[68]
Improving language understanding by gen- erative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by gen- erative pre-training. 2018. 2
work page 2018
-
[69]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(1):5485–5551, 2020. 2
work page 2020
-
[70]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion- 400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021. 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[71]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022. 4
work page 2022
-
[72]
Kvqa: Knowledge-aware visual question answering
Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. Kvqa: Knowledge-aware visual question answering. In Proceedings of the AAAI conference on artificial intelligence, 2019. 4
work page 2019
-
[73]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. In Pro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4
work page 2018
-
[74]
Textcaps: a dataset for image caption- ing with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image caption- ing with reading comprehension. In Computer Vision– ECCV 2020: 16th European Conference, Glasgow, UK, Au- gust 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020. 4
work page 2020
-
[75]
Generative pretraining in mul- timodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in mul- timodality. Jul 2023. 3
work page 2023
-
[76]
Alpha-CLIP: A clip model focusing on wherever you want
Zeyi Sun, Ye Fang, Tong Wu, Pan Zhang, Yuhang Zang, Shu Kong, Yuanjun Xiong, Dahua Lin, and Jiaqi Wang. Alpha-CLIP: A clip model focusing on wherever you want. arXiv preprint arXiv:2312.03818, 2023. 3
-
[77]
Gemini: A family of highly capable multi- modal models, 2023
Gemini Team. Gemini: A family of highly capable multi- modal models, 2023. 1, 2
work page 2023
-
[78]
Internlm: A multilingual language model with progressively enhanced capabilities
InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https : / / github.com/InternLM/InternLM, 2023. 1, 2, 4
work page 2023
-
[79]
Llama: Open and efficient foundation language mod- els
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Bap- tiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language mod- els. arXiv.org, 2023. 2
work page 2023
-
[80]
Llama 2: Open foundation and fine-tuned chat models,
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.