pith. sign in

arxiv: 2308.12067 · v3 · submitted 2023-08-23 · 💻 cs.LG · cs.AI· cs.CL· cs.CV

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

Pith reviewed 2026-05-24 08:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.CV
keywords multimodal large language modelsinstruction tuningdata selectionalignmentvision-language dataquality metricsMiniGPT-4fine-tuning
0
0 comments X

The pith

Multimodal alignment works better with 200 selected examples than with thousands of unfiltered ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that a multimodal large language model fine-tuned on only 200 high-quality vision-language instruction examples can outperform the original MiniGPT-4 model trained on about 3,300 examples. They introduce metrics to evaluate the quality of such data and a trainable selector that automatically chooses the best examples while discarding low-quality ones. This shows that for aligning these models, the quality of instruction data matters more than its quantity. Readers would care because it points to more efficient training processes that require less data collection and computation.

Core claim

MM-LIMA is fine-tuned on a small dataset of only 200 examples, which is approximately 6% of the instruction-following data used for MiniGPT-4, and it outperforms the original model on various evaluations by using proposed quality metrics and a trainable data selector to filter low-quality vision-language data.

What carries the argument

Trainable data selector that identifies high-quality multimodal instruction examples based on several proposed quality metrics for vision-language data.

If this is right

  • High-quality subsets of instruction data can replace full datasets for effective alignment of multimodal models.
  • Automatic filtering reduces reliance on large volumes of potentially noisy vision-language instructions.
  • Less data can lead to better output generation in multimodal large language models.
  • The method extends the less-is-more principle from text models to multimodal settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar data selection techniques could be tested on other multimodal models to see if they yield comparable gains.
  • Future work might explore if these metrics generalize across different types of vision-language tasks.
  • Reducing data needs could lower the barrier for researchers without access to massive datasets.
  • The selector might be adapted to other modalities or even text-only alignment tasks.

Load-bearing premise

The quality metrics accurately capture what makes multimodal instruction data effective for alignment, and the selector does not overfit or introduce selection bias.

What would settle it

Training MM-LIMA or a similar model on 200 randomly chosen examples without the selector and finding that it matches or exceeds the performance of the selected set would falsify the importance of the quality-based selection.

Figures

Figures reproduced from arXiv: 2308.12067 by Lai Wei, Lichao Sun, Weiran Huang, Xiaozhe Li, Zihao Jiang.

Figure 1
Figure 1. Figure 1: Comparison of MME evaluation (InstructionGPT-4 vs. MiniGPT-4). Different from LIMA [14] that requires manually constructed dataset, we aim to propose a robust and effective data selec￾tor that automatically identifies and filters low-quality vision￾language data from existing datasets, ensuring that our model is trained on the most relevant and informative samples. The key focus of our study lies in explor… view at source ↗
Figure 2
Figure 2. Figure 2: Overall procedures of the data selector. We first split the vision-language dataset into [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of MMBench evaluation (InstructionGPT-4 vs. MiniGPT-4). [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GPT-4 Evaluation Comparison (InstructionGPT-4 vs. MiniGPT-4). Given the presence of inherent position bias within LLMs as evaluators, wherein certain positions are favored over others [36], we have under￾taken measures to address this concern. To mitigate such bias, we conduct evaluations using both response orders – placing InstructionGPT-4’s gen￾erated response before and after MiniGPT-4’s response. To e… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation study to investigate the impact of clustering in the testing stage and different types of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The left part denotes different multimodal feature sizes. The right part denotes different curated data [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Data selector can filter out low-quality data (e.g., inappropriate grammar and incomplete expressions). [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can achieve satisfactory results even with a limited amount of high-quality instruction-following data. In this paper, we introduce MM-LIMA, which is fine-tuned on a small dataset comprising only 200 examples, amounting to approximately 6% of the instruction-following data used in the alignment dataset for MiniGPT-4. To achieve this, we first propose several metrics to access the quality of multimodal instruction data. Based on these metrics, we present an effective and trainable data selector to automatically identify and filter low-quality vision-language data. By employing this method, MM-LIMA outperforms the original MiniGPT-4 on various evaluations. Overall, our findings demonstrate that less but high-quality instruction tuning data is efficient in enabling multimodal large language models to generate better output. Our code is available at https://github.com/waltonfuture/InstructionGPT-4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multimodal LLMs can achieve strong alignment with far less instruction data than typically used. Specifically, MM-LIMA is fine-tuned on a 200-example subset (approximately 6% of the data used for MiniGPT-4) obtained by first defining several quality metrics for vision-language instruction pairs and then training a data selector to retain only high-scoring examples; the resulting model is reported to outperform the original MiniGPT-4 across multiple evaluations, supporting the thesis that less but higher-quality data suffices.

Significance. If the empirical claims are substantiated with transparent metric definitions, selector training details, and rigorous controls, the result would meaningfully extend the 'less-is-more' observation from text-only LIMA to the multimodal setting and supply a practical, trainable curation method for instruction data. The availability of code is a positive factor for reproducibility.

major comments (3)
  1. [Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.
  2. [§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.
  3. [§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.
minor comments (2)
  1. [Abstract] The abstract states 'our code is available' but the manuscript does not include a direct link or commit hash in the main text; this should be added for immediate reproducibility.
  2. [§3] Notation for the quality metrics and selector parameters is introduced without a consolidated table of symbols, making it harder to follow the pipeline across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: The central claim that the 200-example subset produces superior performance rests on the proposed quality metrics and trainable selector, yet the manuscript provides no definition of the metrics (e.g., whether they are likelihood-based, feature-based, or human-derived), no training objective or feature set for the selector, and no statement that selector training was performed without access to the downstream evaluation prompts. This absence prevents assessment of whether the reported gains are artifacts of selection bias rather than genuine quality improvement.

    Authors: We agree that the original manuscript did not provide sufficient detail on these aspects. In the revised version, we will add explicit definitions of the quality metrics for vision-language pairs, describe the feature set and training objective for the selector, and include a clear statement confirming that selector training used no downstream evaluation prompts. revision: yes

  2. Referee: [§4] §4 and Table 2 (or equivalent results section): The outperformance statement is given without the exact evaluation metrics, baseline configurations (including whether MiniGPT-4 was re-evaluated under identical conditions), number of runs, or statistical significance tests. Without these, it is impossible to determine whether the gains are reliable or merely reflect variance in a small-data regime.

    Authors: We will revise the results section to specify the exact evaluation metrics, confirm identical re-evaluation conditions for the MiniGPT-4 baseline, report the number of runs, and include statistical significance tests to establish the reliability of the gains. revision: yes

  3. Referee: [§3.2] §3.2: The claim that the selector 'automatically identify and filter low-quality vision-language data' requires evidence that the metrics correlate with human preference judgments on a held-out set independent of the selector's training distribution; no such validation is described, leaving open the possibility that the selector simply retains examples the base model already handles well.

    Authors: The primary support for the selector's effectiveness is the observed downstream performance gains. The manuscript does not include a separate held-out correlation analysis with human preferences. We will expand the discussion of this point and note it as a direction for future validation while retaining the performance-based evidence. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical filtering and comparison are independent of fitted inputs

full rationale

The paper's chain consists of proposing quality metrics, training a selector on those metrics to curate 200 examples, fine-tuning MiniGPT-4 on the curated set, and reporting superior results on external evaluations. No equations, self-citations, or derivations are shown that reduce the performance claim to a tautology or to parameters fitted directly to the target metric. The result is benchmarked against the original model on held-out tasks, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the validity of the newly proposed quality metrics and the effectiveness of the data selector; these are introduced by the paper rather than taken from prior literature.

axioms (1)
  • domain assumption The proposed metrics for assessing multimodal instruction data quality are reliable indicators of usefulness for model alignment.
    The data selection process rests on these metrics being meaningful; the abstract does not provide independent validation.
invented entities (1)
  • Trainable data selector no independent evidence
    purpose: Automatically identify and filter low-quality vision-language instruction data based on the proposed metrics.
    This component is introduced in the paper to enable the 200-example selection.

pith-pipeline@v0.9.0 · 5730 in / 1244 out tokens · 40580 ms · 2026-05-24T08:09:58.574162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SCOPE-RL: Stable and Quantitative Control of Policy Entropy in RL Post-Training

    cs.LG 2025-10 unverdicted novelty 6.0

    SCOPE-RL adds a regularization term built from high-temperature positive samples to quantitatively control entropy dynamics and maintain exploration in RL post-training of reasoning LLMs.

  2. A Survey on Multimodal Large Language Models

    cs.CV 2023-06 accept novelty 3.0

    This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 2 Pith papers · 18 internal anchors

  1. [1]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023

  2. [2]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  4. [4]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971, 2023

  5. [5]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023

  6. [6]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021

  7. [7]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 , 2023

  8. [8]

    LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

    Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023

  9. [9]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023

  10. [10]

    Svit: Scaling up visual instruction tuning

    Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023

  11. [11]

    Llavar: Enhanced visual instruction tuning for text-rich image understanding

    Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023. 11

  12. [12]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023

  13. [13]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Jingkang Yang, and Ziwei Liu. Otter: A multi- modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023

  14. [14]

    LIMA: Less Is More for Alignment

    Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023

  15. [15]

    Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023

  16. [16]

    Instruction mining: High-quality instruction data selection for large language models

    Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  18. [18]

    Reward model trained from human feedback

    OpenAssistant. Reward model trained from human feedback. https://huggingface.co/ OpenAssistant/reward-model-deberta-v3-large-v2 , 2023

  19. [19]

    On spectral clustering: Analysis and an algorithm

    Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. 2001

  20. [20]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023

  21. [21]

    MMBench: Is Your Multi-modal Model an All-around Player?

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023

  22. [22]

    Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

    Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, and Ping Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  24. [24]

    The truth of the f-measure

    Yutaka Sasaki et al. The truth of the f-measure. Teach tutor mater, 2007

  25. [25]

    Openclip

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. 2021

  26. [26]

    Deep learning on a data diet: Finding important examples early in training

    Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. NeurIPS, 2021

  27. [27]

    K-means++ the advantages of careful seeding

    David Arthur and Sergei Vassilvitskii. K-means++ the advantages of careful seeding. In SODA, 2007

  28. [28]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019. 12

  29. [29]

    Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning

    Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Zhu. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. arXiv preprint arXiv:2110.13214, 2021

  30. [30]

    Learn to explain: Multimodal reasoning via thought chains for science question answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. arXiv preprint arXiv:2209.09513, 2022

  31. [31]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019

  32. [32]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In WACV, 2021

  33. [33]

    Towards vqa models that can read

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019

  34. [34]

    Icdar 2019 competition on scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Minesh Mathew, CV Jawahar, Ernest Valveny, and Dimosthenis Karatzas. Icdar 2019 competition on scene text visual question answering. In ICDAR. IEEE, 2019

  35. [35]

    Vizwiz: nearly real-time answers to visual questions

    Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In UIST, 2010

  36. [36]

    Large Language Models are not Fair Evaluators

    Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023

  37. [37]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  38. [38]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback, 2022. arXiv preprint arXiv:2203.02155, 2022

  39. [39]

    LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 13 Appendix A Implementation Details of InstructionGPT-4 In this section, we first present the design of our GPT-4 prompt for r...

  40. [40]

    Wash and slice the fruit. 2. In a large bowl, mix the yogurt, honey, and lime juice. 3. Add the fruit to the bowl and mix everything together until the fruit is evenly coated in the yogurt mixture. 4. Add the raisins to the bowl and stir. 5. Serve the fruit salad chilled. You can adjust the amount of fruit and yogurt to your liking. You can also add other...

  41. [41]

    In a large bowl, mix together the sliced bananas, apples, grapes, strawberries, blueberries, and rice. 2. In a separate bowl, mix together the sugar and lemon juice until the sugar is fully dissolved. 3. Pour the sugar mixture over the fruit mixture and toss to coat everything evenly. 4. Serve the fruit salad cold or chilled, garnished with lemon wedges o...