MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

Lai Wei; Lichao Sun; Weiran Huang; Xiaozhe Li; Zihao Jiang

arxiv: 2308.12067 · v3 · submitted 2023-08-23 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

Lai Wei , Xiaozhe Li , Zihao Jiang , Weiran Huang , Lichao Sun This is my paper

Pith reviewed 2026-05-24 08:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV

keywords multimodal large language modelsinstruction tuningdata selectionalignmentvision-language dataquality metricsMiniGPT-4fine-tuning

0 comments

The pith

Multimodal alignment works better with 200 selected examples than with thousands of unfiltered ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a multimodal large language model fine-tuned on only 200 high-quality vision-language instruction examples can outperform the original MiniGPT-4 model trained on about 3,300 examples. They introduce metrics to evaluate the quality of such data and a trainable selector that automatically chooses the best examples while discarding low-quality ones. This shows that for aligning these models, the quality of instruction data matters more than its quantity. Readers would care because it points to more efficient training processes that require less data collection and computation.

Core claim

MM-LIMA is fine-tuned on a small dataset of only 200 examples, which is approximately 6% of the instruction-following data used for MiniGPT-4, and it outperforms the original model on various evaluations by using proposed quality metrics and a trainable data selector to filter low-quality vision-language data.

What carries the argument

Trainable data selector that identifies high-quality multimodal instruction examples based on several proposed quality metrics for vision-language data.

If this is right

High-quality subsets of instruction data can replace full datasets for effective alignment of multimodal models.
Automatic filtering reduces reliance on large volumes of potentially noisy vision-language instructions.
Less data can lead to better output generation in multimodal large language models.
The method extends the less-is-more principle from text models to multimodal settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar data selection techniques could be tested on other multimodal models to see if they yield comparable gains.
Future work might explore if these metrics generalize across different types of vision-language tasks.
Reducing data needs could lower the barrier for researchers without access to massive datasets.
The selector might be adapted to other modalities or even text-only alignment tasks.

Load-bearing premise

The quality metrics accurately capture what makes multimodal instruction data effective for alignment, and the selector does not overfit or introduce selection bias.

What would settle it

Training MM-LIMA or a similar model on 200 randomly chosen examples without the selector and finding that it matches or exceeds the performance of the selected set would falsify the importance of the quality-based selection.

Figures

Figures reproduced from arXiv: 2308.12067 by Lai Wei, Lichao Sun, Weiran Huang, Xiaozhe Li, Zihao Jiang.

**Figure 1.** Figure 1: Comparison of MME evaluation (InstructionGPT-4 vs. MiniGPT-4). Different from LIMA [14] that requires manually constructed dataset, we aim to propose a robust and effective data selector that automatically identifies and filters low-quality visionlanguage data from existing datasets, ensuring that our model is trained on the most relevant and informative samples. The key focus of our study lies in explor… view at source ↗

**Figure 2.** Figure 2: Overall procedures of the data selector. We first split the vision-language dataset into [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of MMBench evaluation (InstructionGPT-4 vs. MiniGPT-4). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: GPT-4 Evaluation Comparison (InstructionGPT-4 vs. MiniGPT-4). Given the presence of inherent position bias within LLMs as evaluators, wherein certain positions are favored over others [36], we have undertaken measures to address this concern. To mitigate such bias, we conduct evaluations using both response orders – placing InstructionGPT-4’s generated response before and after MiniGPT-4’s response. To e… view at source ↗

**Figure 5.** Figure 5: Ablation study to investigate the impact of clustering in the testing stage and different types of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The left part denotes different multimodal feature sizes. The right part denotes different curated data [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Data selector can filter out low-quality data (e.g., inappropriate grammar and incomplete expressions). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, MM-LIMA outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MM-LIMA filters to 200 examples via new metrics and a trainable selector and reports gains over MiniGPT-4, but the metrics lack shown correlation with human judgments or held-out checks.

read the letter

The main point is that the authors define quality metrics for vision-language instructions, train a selector on them, and end up with a 200-example subset that they say beats the full MiniGPT-4 alignment set on evaluations. This is roughly 6% of the original data volume. The work takes the LIMA observation from language models and applies it to the multimodal case with a concrete number and released code. That scale reduction is the usable takeaway for people who care about training cost. The trainable selector is a clear step past hand-written filters, and the github link lets others reproduce the pipeline directly. The empirical claim is specific enough to test. The soft spot is validation. The metrics are not shown to track human preference ratings on a separate set, and the selector training procedure is not described as isolated from the evaluation prompts. If the metrics lean on model likelihoods or features that already favor certain response patterns, the selected examples could simply be the ones the base model already handles well. That would make the reported improvement an artifact rather than evidence for the less-is-more claim. The abstract gives no statistical significance numbers or exact metric definitions, so the full paper has to supply those to make the result convincing. Readers working on data-efficient multimodal alignment will get practical value from the method and the 200-example result. The paper is coherent on its own terms and ships reproducible artifacts, so it deserves a serious referee even though the validation sections will need strengthening.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal LLMs can achieve strong alignment with far less instruction data than typically used. Specifically, MM-LIMA is fine-tuned on a 200-example subset (approximately 6% of the data used for MiniGPT-4) obtained by first defining several quality metrics for vision-language instruction pairs and then training a data selector to retain only high-scoring examples; the resulting model is reported to outperform the original MiniGPT-4 across multiple evaluations, supporting the thesis that less but higher-quality data suffices.

Significance. If the empirical claims are substantiated with transparent metric definitions, selector training details, and rigorous controls, the result would meaningfully extend the 'less-is-more' observation from text-only LIMA to the multimodal setting and supply a practical, trainable curation method for instruction data. The availability of code is a positive factor for reproducibility.

major comments (3)

[Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.
[§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.
[§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.

minor comments (2)

[Abstract] The abstract states 'our code is available' but the manuscript does not include a direct link or commit hash in the main text; this should be added for immediate reproducibility.
[§3] Notation for the quality metrics and selector parameters is introduced without a consolidated table of symbols, making it harder to follow the pipeline across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.

Authors: We agree that the original manuscript did not provide sufficient detail on these aspects. In the revised version, we will add explicit definitions of the quality metrics for vision-language pairs, describe the feature set and training objective for the selector, and include a clear statement confirming that selector training used no downstream evaluation prompts. revision: yes
Referee: [§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.

Authors: We will revise the results section to specify the exact evaluation metrics, confirm identical re-evaluation conditions for the MiniGPT-4 baseline, report the number of runs, and include statistical significance tests to establish the reliability of the gains. revision: yes
Referee: [§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.

Authors: The primary support for the selector's effectiveness is the observed downstream performance gains. The manuscript does not include a separate held-out correlation analysis with human preferences. We will expand the discussion of this point and note it as a direction for future validation while retaining the performance-based evidence. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical filtering and comparison are independent of fitted inputs

full rationale

The paper's chain consists of proposing quality metrics, training a selector on those metrics to curate 200 examples, fine-tuning MiniGPT-4 on the curated set, and reporting superior results on external evaluations. No equations, self-citations, or derivations are shown that reduce the performance claim to a tautology or to parameters fitted directly to the target metric. The result is benchmarked against the original model on held-out tasks, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the newly proposed quality metrics and the effectiveness of the data selector; these are introduced by the paper rather than taken from prior literature.

axioms (1)

domain assumption The proposed metrics for assessing multimodal instruction data quality are reliable indicators of usefulness for model alignment.
The data selection process rests on these metrics being meaningful; the abstract does not provide independent validation.

invented entities (1)

Trainable data selector no independent evidence
purpose: Automatically identify and filter low-quality vision-language instruction data based on the proposed metrics.
This component is introduced in the paper to enable the 200-example selection.

pith-pipeline@v0.9.0 · 5730 in / 1244 out tokens · 40580 ms · 2026-05-24T08:09:58.574162+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training
cs.LG 2025-10 unverdicted novelty 6.0

SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.
A Survey on Multimodal Large Language Models
cs.CV 2023-06 accept novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 18 internal anchors

[1]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023

work page 2023
[6]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021
[7]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Svit: Scaling up visual instruction tuning

Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023

work page arXiv 2023
[11]

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 11

work page arXiv 2023
[12]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi- modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023

work page arXiv 2023
[16]

Instruction mining: High-quality instruction data selection for large language models

Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023

work page arXiv 2023
[17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[18]

Reward model trained from human feedback

OpenAssistant. Reward model trained from human feedback. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large-v2 , 2023

work page 2023
[19]

On spectral clustering: Analysis and an algorithm

Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. 2001

work page 2001
[20]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

work page arXiv 2023
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

The truth of the f-measure

Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 2007

work page 2007
[25]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021

work page 2021
[26]

Deep learning on a data diet: Finding important examples early in training

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. NeurIPS, 2021

work page 2021
[27]

K-means++ the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, 2007

work page 2007
[28]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 12

work page 2019
[29]

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

work page arXiv 2021
[30]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022

work page arXiv 2022
[31]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019

work page 2019
[32]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021

work page 2021
[33]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019

work page 2019
[34]

Icdar 2019 competition on scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In ICDAR. IEEE, 2019

work page 2019
[35]

Vizwiz: nearly real-time answers to visual questions

Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, 2010

work page 2010
[36]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 13 Appendix A Implementation Details of InstructionGPT-4 In this section, we first present the design of our GPT-4 prompt for r...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Wash and slice the fruit. 2. In a large bowl, mix the yogurt, honey, and lime juice. 3. Add the fruit to the bowl and mix everything together until the fruit is evenly coated in the yogurt mixture. 4. Add the raisins to the bowl and stir. 5. Serve the fruit salad chilled. You can adjust the amount of fruit and yogurt to your liking. You can also add other...

work page
[41]

In a large bowl, mix together the sliced bananas, apples, grapes, strawberries, blueberries, and rice. 2. In a separate bowl, mix together the sugar and lemon juice until the sugar is fully dissolved. 3. Pour the sugar mixture over the fruit mixture and toss to coat everything evenly. 4. Serve the fruit salad cold or chilled, garnished with lemon wedges o...

work page

[1] [1]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023

work page 2023

[6] [6]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

work page 2021

[7] [7]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Svit: Scaling up visual instruction tuning

Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023

work page arXiv 2023

[11] [11]

Llavar: Enhanced visual instruction tuning for text-rich image understanding

Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 11

work page arXiv 2023

[12] [12]

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi- modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023

work page arXiv 2023

[16] [16]

Instruction mining: High-quality instruction data selection for large language models

Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023

work page arXiv 2023

[17] [17]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[18] [18]

Reward model trained from human feedback

OpenAssistant. Reward model trained from human feedback. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large-v2 , 2023

work page 2023

[19] [19]

On spectral clustering: Analysis and an algorithm

Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. 2001

work page 2001

[20] [20]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

work page arXiv 2023

[23] [23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

The truth of the f-measure

Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 2007

work page 2007

[25] [25]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021

work page 2021

[26] [26]

Deep learning on a data diet: Finding important examples early in training

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. NeurIPS, 2021

work page 2021

[27] [27]

K-means++ the advantages of careful seeding

David Arthur and Sergei Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, 2007

work page 2007

[28] [28]

Gqa: A new dataset for real-world visual reasoning and compositional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 12

work page 2019

[29] [29]

Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

work page arXiv 2021

[30] [30]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022

work page arXiv 2022

[31] [31]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019

work page 2019

[32] [32]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021

work page 2021

[33] [33]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019

work page 2019

[34] [34]

Icdar 2019 competition on scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In ICDAR. IEEE, 2019

work page 2019

[35] [35]

Vizwiz: nearly real-time answers to visual questions

Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, 2010

work page 2010

[36] [36]

Large Language Models are not Fair Evaluators

Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 13 Appendix A Implementation Details of InstructionGPT-4 In this section, we first present the design of our GPT-4 prompt for r...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

Wash and slice the fruit. 2. In a large bowl, mix the yogurt, honey, and lime juice. 3. Add the fruit to the bowl and mix everything together until the fruit is evenly coated in the yogurt mixture. 4. Add the raisins to the bowl and stir. 5. Serve the fruit salad chilled. You can adjust the amount of fruit and yogurt to your liking. You can also add other...

work page

[41] [41]

In a large bowl, mix together the sliced bananas, apples, grapes, strawberries, blueberries, and rice. 2. In a separate bowl, mix together the sugar and lemon juice until the sugar is fully dissolved. 3. Pour the sugar mixture over the fruit mixture and toss to coat everything evenly. 4. Serve the fruit salad cold or chilled, garnished with lemon wedges o...

work page