MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets
Pith reviewed 2026-05-24 08:09 UTC · model grok-4.3
The pith
Multimodal alignment works better with 200 selected examples than with thousands of unfiltered ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MM-LIMA is fine-tuned on a small dataset of only 200 examples, which is approximately 6% of the instruction-following data used for MiniGPT-4, and it outperforms the original model on various evaluations by using proposed quality metrics and a trainable data selector to filter low-quality vision-language data.
What carries the argument
Trainable data selector that identifies high-quality multimodal instruction examples based on several proposed quality metrics for vision-language data.
If this is right
- High-quality subsets of instruction data can replace full datasets for effective alignment of multimodal models.
- Automatic filtering reduces reliance on large volumes of potentially noisy vision-language instructions.
- Less data can lead to better output generation in multimodal large language models.
- The method extends the less-is-more principle from text models to multimodal settings.
Where Pith is reading between the lines
- Similar data selection techniques could be tested on other multimodal models to see if they yield comparable gains.
- Future work might explore if these metrics generalize across different types of vision-language tasks.
- Reducing data needs could lower the barrier for researchers without access to massive datasets.
- The selector might be adapted to other modalities or even text-only alignment tasks.
Load-bearing premise
The quality metrics accurately capture what makes multimodal instruction data effective for alignment, and the selector does not overfit or introduce selection bias.
What would settle it
Training MM-LIMA or a similar model on 200 randomly chosen examples without the selector and finding that it matches or exceeds the performance of the selected set would falsify the importance of the quality-based selection.
Figures
read the original abstract
Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, MM-LIMA outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multimodal LLMs can achieve strong alignment with far less instruction data than typically used. Specifically, MM-LIMA is fine-tuned on a 200-example subset (approximately 6% of the data used for MiniGPT-4) obtained by first defining several quality metrics for vision-language instruction pairs and then training a data selector to retain only high-scoring examples; the resulting model is reported to outperform the original MiniGPT-4 across multiple evaluations, supporting the thesis that less but higher-quality data suffices.
Significance. If the empirical claims are substantiated with transparent metric definitions, selector training details, and rigorous controls, the result would meaningfully extend the 'less-is-more' observation from text-only LIMA to the multimodal setting and supply a practical, trainable curation method for instruction data. The availability of code is a positive factor for reproducibility.
major comments (3)
- [Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.
- [§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.
- [§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.
minor comments (2)
- [Abstract] The abstract states 'our code is available' but the manuscript does not include a direct link or commit hash in the main text; this should be added for immediate reproducibility.
- [§3] Notation for the quality metrics and selector parameters is introduced without a consolidated table of symbols, making it harder to follow the pipeline across sections.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.
Authors: We agree that the original manuscript did not provide sufficient detail on these aspects. In the revised version, we will add explicit definitions of the quality metrics for vision-language pairs, describe the feature set and training objective for the selector, and include a clear statement confirming that selector training used no downstream evaluation prompts. revision: yes
-
Referee: [§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.
Authors: We will revise the results section to specify the exact evaluation metrics, confirm identical re-evaluation conditions for the MiniGPT-4 baseline, report the number of runs, and include statistical significance tests to establish the reliability of the gains. revision: yes
-
Referee: [§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.
Authors: The primary support for the selector's effectiveness is the observed downstream performance gains. The manuscript does not include a separate held-out correlation analysis with human preferences. We will expand the discussion of this point and note it as a direction for future validation while retaining the performance-based evidence. revision: partial
Circularity Check
No circularity; empirical filtering and comparison are independent of fitted inputs
full rationale
The paper's chain consists of proposing quality metrics, training a selector on those metrics to curate 200 examples, fine-tuning MiniGPT-4 on the curated set, and reporting superior results on external evaluations. No equations, self-citations, or derivations are shown that reduce the performance claim to a tautology or to parameters fitted directly to the target metric. The result is benchmarked against the original model on held-out tasks, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The proposed metrics for assessing multimodal instruction data quality are reliable indicators of usefulness for model alignment.
invented entities (1)
-
Trainable data selector
no independent evidence
Forward citations
Cited by 2 Pith papers
-
SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training
SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023
work page 2023
-
[6]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021
work page 2021
-
[7]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Svit: Scaling up visual instruction tuning
Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023
-
[11]
Llavar: Enhanced visual instruction tuning for text-rich image understanding
Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 11
-
[12]
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi- modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
LIMA: Less Is More for Alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023
-
[16]
Instruction mining: High-quality instruction data selection for large language models
Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023
-
[17]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
Reward model trained from human feedback
OpenAssistant. Reward model trained from human feedback. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large-v2 , 2023
work page 2023
-
[19]
On spectral clustering: Analysis and an algorithm
Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. 2001
work page 2001
-
[20]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023
-
[23]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 2007
work page 2007
- [25]
-
[26]
Deep learning on a data diet: Finding important examples early in training
Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. NeurIPS, 2021
work page 2021
-
[27]
K-means++ the advantages of careful seeding
David Arthur and Sergei Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, 2007
work page 2007
-
[28]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 12
work page 2019
-
[29]
Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021
-
[30]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022
-
[31]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019
work page 2019
-
[32]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021
work page 2021
-
[33]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019
work page 2019
-
[34]
Icdar 2019 competition on scene text visual question answering
Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In ICDAR. IEEE, 2019
work page 2019
-
[35]
Vizwiz: nearly real-time answers to visual questions
Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, 2010
work page 2010
-
[36]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 13 Appendix A Implementation Details of InstructionGPT-4 In this section, we first present the design of our GPT-4 prompt for r...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Wash and slice the fruit. 2. In a large bowl, mix the yogurt, honey, and lime juice. 3. Add the fruit to the bowl and mix everything together until the fruit is evenly coated in the yogurt mixture. 4. Add the raisins to the bowl and stir. 5. Serve the fruit salad chilled. You can adjust the amount of fruit and yogurt to your liking. You can also add other...
-
[41]
In a large bowl, mix together the sliced bananas, apples, grapes, strawberries, blueberries, and rice. 2. In a separate bowl, mix together the sugar and lemon juice until the sugar is fully dissolved. 3. Pour the sugar mixture over the fruit mixture and toss to coat everything evenly. 4. Serve the fruit salad cold or chilled, garnished with lemon wedges o...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.