pith. machine review for the scientific record. sign in

arxiv: 2605.10641 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords knowledge distillationvision-language modelsvisual question answeringmodel compressioncascaded distillation
0
0 comments X

The pith

Cascaded distillation with intermediate teachers narrows the capacity gap between large vision-language models and their smaller counterparts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a bottom-up cascaded knowledge distillation framework that inserts one or more intermediate-capacity teacher models between a high-capacity teacher and a compact student. This gradual progression counters the degradation in knowledge transfer that occurs when the capacity difference becomes too large. The authors supply a theoretical analysis of how the cascade affects the student's generalization and then apply the method to LLaVA-style vision-language models. On seven public visual-question-answering benchmarks the resulting students reach state-of-the-art accuracy while remaining small enough for practical deployment.

Core claim

Instead of distilling directly from one large teacher to a much smaller student, the bottom-up cascaded framework introduces intermediate teachers that successively raise the student's capacity level until the final high-capacity teacher can transfer its knowledge effectively, yielding models that outperform prior distillation baselines on standard VQA tasks.

What carries the argument

The bottom-up cascaded knowledge distillation (CKD) process, which chains teachers of increasing capacity to bridge the gap to the student model.

Load-bearing premise

Adding intermediate teachers improves knowledge transfer without creating new optimization difficulties or harming generalization.

What would settle it

A head-to-head comparison on the same seven VQA benchmarks in which direct single-teacher distillation from the largest model to the smallest student matches or exceeds the cascaded version.

Figures

Figures reproduced from arXiv: 2605.10641 by Nikolaos Gkalelis, Vasileios Mezaris.

Figure 1
Figure 1. Figure 1: Illustration of different KD strategies. Baseline KD suffers from the large capacity gap [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LLaVA-CKD, a bottom-up cascaded knowledge distillation (CKD) framework for compressing large vision-language models based on the LLaVA methodology. It introduces one or more intermediate-capacity teachers to gradually bridge the capacity gap between a high-capacity teacher and a smaller student, provides a theoretical analysis of the effect of cascaded distillation on student generalization performance, and reports state-of-the-art results on seven standard publicly available VQA benchmarks.

Significance. If the central claims hold, the work could meaningfully advance practical deployment of VLMs by improving knowledge transfer efficiency in distillation without requiring ensembles of large teachers. The attempt at a theoretical analysis of generalization effects and evaluation on public benchmarks are strengths that would support broader adoption if the gains are shown to stem specifically from the cascade structure rather than ancillary factors.

major comments (3)
  1. [Theoretical Analysis] The theoretical analysis is invoked to study cascaded distillation's impact on student generalization, yet the manuscript provides no derivation or tightness argument showing that the bound holds under LLaVA's specific training regime (contrastive loss plus language modeling on image-text pairs). This leaves the load-bearing claim that intermediate teachers improve generalization without new pathologies unverified.
  2. [Experiments] The empirical SOTA claims on the seven VQA benchmarks rest on comparisons that do not isolate the cascade structure; no ablation holds total teacher-student training FLOPs fixed while varying only the number of distillation stages. Without this control, observed gains could arise from differences in compute budget, data ordering, or hyper-parameters rather than the proposed bottom-up framework.
  3. [Method] The core assumption that inserting intermediate-capacity teachers reliably bridges the capacity gap without introducing optimization issues or error propagation is stated but not subjected to controlled tests (e.g., direct high-to-low vs. cascaded under matched total compute). This assumption is load-bearing for both the method and the generalization claims.
minor comments (2)
  1. [Method] Notation for the number and capacities of intermediate teachers should be formalized with explicit variables rather than left as 'one (potentially, more) additional Teacher(s)'.
  2. [Introduction] The abstract and introduction would benefit from a concise statement of the exact LLaVA variants used as teacher, intermediate, and student models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where the concerns identify gaps in the presented evidence.

read point-by-point responses
  1. Referee: [Theoretical Analysis] The theoretical analysis is invoked to study cascaded distillation's impact on student generalization, yet the manuscript provides no derivation or tightness argument showing that the bound holds under LLaVA's specific training regime (contrastive loss plus language modeling on image-text pairs). This leaves the load-bearing claim that intermediate teachers improve generalization without new pathologies unverified.

    Authors: We appreciate the referee's emphasis on rigor. Section 3 of the manuscript presents a generalization bound for cascaded distillation that demonstrates how intermediate teachers reduce the effective capacity gap under standard assumptions on loss smoothness and data distribution. The analysis is formulated generally to apply across distillation settings. However, we acknowledge that an explicit derivation instantiating the bound for the precise combination of contrastive loss and autoregressive language modeling used in LLaVA is not provided, nor is a tightness argument given for this regime. We will revise the paper to include this derivation in Section 3 (or an expanded appendix) together with a discussion of tightness under the LLaVA objective. revision: yes

  2. Referee: [Experiments] The empirical SOTA claims on the seven VQA benchmarks rest on comparisons that do not isolate the cascade structure; no ablation holds total teacher-student training FLOPs fixed while varying only the number of distillation stages. Without this control, observed gains could arise from differences in compute budget, data ordering, or hyper-parameters rather than the proposed bottom-up framework.

    Authors: We agree that isolating the contribution of the cascade structure requires tighter controls. The reported experiments compare LLaVA-CKD against direct-distillation and other baselines while attempting to keep overall training resources comparable, but we did not include an ablation that explicitly fixes total teacher-student FLOPs and varies only the number of stages. We will add this controlled ablation to the experimental section of the revised manuscript to demonstrate that performance differences arise from the cascaded structure itself. revision: yes

  3. Referee: [Method] The core assumption that inserting intermediate-capacity teachers reliably bridges the capacity gap without introducing optimization issues or error propagation is stated but not subjected to controlled tests (e.g., direct high-to-low vs. cascaded under matched total compute). This assumption is load-bearing for both the method and the generalization claims.

    Authors: This concern is valid. The manuscript includes direct comparisons of cascaded versus single-stage distillation, yet these comparisons were not performed under strictly matched total compute budgets. We will add new experiments in the revised version that directly contrast high-to-low distillation against the cascaded approach while holding total training FLOPs constant, thereby testing for optimization difficulties or error propagation. revision: yes

Circularity Check

0 steps flagged

No circularity: CKD framework and generalization analysis are independent proposals with external empirical validation.

full rationale

The paper introduces a bottom-up cascaded knowledge distillation framework as a novel proposal, accompanied by a theoretical analysis of its impact on student generalization performance. No equations, predictions, or central claims reduce by construction to fitted parameters, self-defined quantities, or load-bearing self-citations from the authors' prior work. The LLaVA-based implementation and SotA results on seven VQA benchmarks are presented as downstream empirical outcomes rather than tautological restatements of inputs. This satisfies the default expectation of self-contained derivation without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that capacity gaps degrade distillation and that intermediate models can be chosen to mitigate this; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)
  • Capacity and number of intermediate teachers
    Must be selected to create a gradual progression; values are not stated in abstract but are central to the method.
axioms (1)
  • domain assumption Beyond a point, larger capacity gap between teacher and student leads to worse knowledge transfer
    Explicitly stated as motivation in the abstract.

pith-pipeline@v0.9.0 · 5555 in / 1164 out tokens · 33162 ms · 2026-05-12T05:07:47.895712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

  1. [1]

    A comprehensive survey and guide to multimodal large language models in vision-language tasks

    C. X. Liang, P. Tian, C. H. Yin, Y . Yua, W. An-Hou, L. Ming, T. Wang, Z. Bi, and M. Liu, “A com- prehensive survey and guide to multimodal large language models in vision-language tasks,”CoRR, vol. abs/2411.06284, 2024

  2. [2]

    Large multimodal agents: a survey,

    J. Xie, Z. Chen, R. Zhang, and G. Li, “Large multimodal agents: a survey,”Vis. Intell., vol. 3, Nov. 2025

  3. [3]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”CoRR, vol. abs/2001.08361, 2020

  4. [4]

    Training compute-optimal large language models,

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, “Training compute-optimal large language models,” inNeurIPS, 2022

  5. [5]

    Efficient multimodal large language models: a survey,

    Y . Jin, J. Li, T. Gu, Y . Liu, B. Zhao, J. Lai, Z. Gan, Y . Wang, C. Wang, X. Tan, and L. Ma, “Efficient multimodal large language models: a survey,”Vis. Intell., vol. 3, Dec. 2025

  6. [6]

    Vision-language models for edge networks: A comprehensive survey,

    A. Sharshar, L. U. Khan, W. Ullah, and M. Guizani, “Vision-language models for edge networks: A comprehensive survey,”IEEE IoT-J, vol. 12, no. 16, pp. 32701–32724, 2025

  7. [7]

    Llava-kd: A framework of distilling multimodal large language models,

    Y . Cai, J. Zhang, H. He, X. He, A. Tong, Z. Gan, C. Wang, Z. Xue, Y . Liu, and X. Bai, “Llava-kd: A framework of distilling multimodal large language models,” inICCV, pp. 239–249, Oct. 2025

  8. [8]

    Distillation scaling laws,

    D. Busbridge, A. Shidani, F. Weers, J. Ramapuram, E. Littwin, and R. Webb, “Distillation scaling laws,” in ICML, July 2025

  9. [9]

    Improved knowl- edge distillation via teacher assistant,

    S.-I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowl- edge distillation via teacher assistant,” inAAAI, 2020

  10. [10]

    Densely guided knowledge distillation using multiple teacher assistants,

    W. Son, J. Na, J. Choi, and W. Hwang, “Densely guided knowledge distillation using multiple teacher assistants,” inICCV, pp. 9375–9384, Oct. 2021

  11. [11]

    Improved knowledge distillation via teacher assistants for sentiment analysis,

    X. Dong, O. Huang, P. Thulasiraman, and A. Mahanti, “Improved knowledge distillation via teacher assistants for sentiment analysis,” inIEEE SSCI, pp. 300–305, Dec. 2023

  12. [12]

    Efficient multimodal large language models: A survey,

    Y . Jin, J. Li, Y . Liu, T. Gu, K. Wu, Z. Jiang, M. He, B. Zhao, X. Tan, Z. Gan, Y . Wang, C. Wang, and L. Ma, “Efficient multimodal large language models: A survey,”Vis. Intell., vol. 3, Nov. 2025

  13. [13]

    Mobilevlm : A fast, strong and open vision language assistant for mobile devices

    X. Chu, L. Qiao, X. Lin, S. Xu, Y . Yang, Y . Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, and C. Shen, “MobileVLM : A fast, strong and open vision language assistant for mobile devices,”CoRR, vol. abs/2312.16886, 2023

  14. [14]

    Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

    X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y . Yang, X. Sun, Y . Hu, X. Lin, B. Zhang, and C. Shen, “MobileVLM V2: Faster and stronger baseline for vision language model,”CoRR, vol. abs/2402.03766, 2024

  15. [15]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inNeurIPS, Dec. 2023

  16. [16]

    Tinyllava: A framework of small-scale large multimodal models,

    B. Zhou, Y . Hu, X. Weng, J. Jia, J. Luo, X. Liu, J. Wu, and L. Huang, “Tinyllava: A framework of small-scale large multimodal models,”CoRR, vol. abs/2402.14289, 2024

  17. [17]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inCVPR, pp. 26286– 26296, June 2024

  18. [18]

    Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

    G. H. Chen, S. Chen, R. Zhang, J. Chen, X. Wu, Z. Zhang, Z. Chen, J. Li, X. Wan, and B. Wang, “Allava: Harnessing GPT4V-synthesized data for A lite vision-language model,”CoRR, vol. abs/2402.11684, 2024

  19. [19]

    LLaV A-MORE: A comparative study of LLMs and visual backbones for enhanced visual instruction tuning,

    F. Cocchi, N. Moratelli, D. Caffagni, S. Sarto, L. Baraldi, M. Cornia, and R. Cucchiara, “LLaV A-MORE: A comparative study of LLMs and visual backbones for enhanced visual instruction tuning,” inICCVW, pp. 4337–4347, Oct. 2025. 10

  20. [20]

    Imp: Highly capable large multimodal models for mobile devices,

    Z. Shao, Z. Yu, J. Yu, X. Ouyang, L. Zheng, Z. Gai, M. Wang, and J. Ding, “Imp: Highly capable large multimodal models for mobile devices,”CoRR, vol. abs/2405.12107, 2024

  21. [21]

    Imp: Highly capable large multimodal models for mobile devices,

    Z. Shao, Z. Yu, J. Yu, X. Ouyang, L. Zheng, Z. Gai, M. Wang, Z. Kuang, and J. Ding, “Imp: Highly capable large multimodal models for mobile devices,”IEEE TMM, vol. 27, pp. 2961–2974, 2025

  22. [22]

    Efficient multimodal learning from data-centric perspective,

    M. He, Y . Liu, B. Wu, J. Yuan, Y . Wang, T. Huang, and B. Zhao, “Efficient multimodal learning from data-centric perspective,”CoRR, vol. abs/2402.11530, 2024

  23. [23]

    SPHINX-x: Scaling data and parameters for a family of multi-modal large language models,

    D. Liu, R. Zhang, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, K. Zhang, W. Shao, C. Xu, C. He, J. He, H. Shao, P. Lu, Y . Qiao, H. Li, and P. Gao, “SPHINX-x: Scaling data and parameters for a family of multi-modal large language models,” inICML, vol. 235, pp. 32400–32420, July 2024

  24. [24]

    Layer-wise vision injection with disentangled attention for efficient LVLMs,

    X. Zhang, D. Li, B. Liu, Z. Bao, Y . Zhou, B. Yang, Z. Liu, Y . Zhong, and T. Yuan, “Layer-wise vision injection with disentangled attention for efficient LVLMs,” inICCV, pp. 239–249, Oct. 2025

  25. [25]

    MoE-LLaV A: Mixture of experts for large vision-language models,

    B. Lin, Z. Tang, Y . Ye, J. Huang, J. Zhang, Y . Pang, P. Jin, M. Ning, J. Luo, and L. Yuan, “MoE-LLaV A: Mixture of experts for large vision-language models,”IEEE TMM, pp. 1–14, 2026

  26. [26]

    Mini-gemini: Mining the potential of multi-modality vision language models,

    Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia, “Mini-gemini: Mining the potential of multi-modality vision language models,”IEEE TPAMI, vol. 48, no. 3, pp. 3530–3543, 2026

  27. [27]

    Q-VLM: post-training quantization for large vision-language models,

    C. Wang, Z. Wang, X. Xu, Y . Tang, J. Zhou, and J. Lu, “Q-VLM: post-training quantization for large vision-language models,” inNeurIPS, Dec. 2024

  28. [28]

    MBQ: modality-balanced quantization for large vision-language models,

    S. Li, Y . Hu, X. Ning, X. Liu, K. Hong, X. Jia, X. Li, Y . Yan, P. Ran, G. Dai, S. Yan, H. Yang, and Y . Wang, “MBQ: modality-balanced quantization for large vision-language models,” inCVPR, pp. 4167–4177, June 2025

  29. [29]

    Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models,

    Y . Sung, J. Yoon, and M. Bansal, “Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models,” inICLR, 2024

  30. [30]

    MULTIFLOW: shifting towards task-agnostic vision-language pruning,

    M. Farina, M. Mancini, E. Cunegatti, G. Liu, G. Iacca, and E. Ricci, “MULTIFLOW: shifting towards task-agnostic vision-language pruning,” inCVPR, pp. 16185–16195, June 2024

  31. [31]

    Langvision-lora-nas: Neural architecture search for variable lora rank in vision language models,

    K. T. Chitty-Venkata, M. Emani, and V . Vishwanath, “Langvision-lora-nas: Neural architecture search for variable lora rank in vision language models,” inICIP, pp. 1330–1335, Sept. 2025

  32. [32]

    LLA V ADI: what matters for multimodal large language models distillation,

    S. Xu, X. Li, H. Yuan, L. Qi, Y . Tong, and M. Yang, “LLA V ADI: what matters for multimodal large language models distillation,”CoRR, vol. abs/2407.19409, 2024

  33. [33]

    Move-kd: Knowledge distillation for vlms with mixture of visual encoders,

    J. Cao, Y . Zhang, T. Huang, M. Lu, Q. Zhang, R. An, N. Ma, and S. Zhang, “Move-kd: Knowledge distillation for vlms with mixture of visual encoders,” inCVPR, pp. 19846–19856, June 2025

  34. [34]

    Llava-mod: Making llava tiny via MoE-knowledge distillation,

    F. Shu, Y . Liao, L. Zhang, L. Zhuo, C. Xu, G. Zhang, H. Shi, L. Chan, T. Zhong, Z. Yu, W. He, S. Fu, H. Li, S. Liu, H. Li, and H. Jiang, “Llava-mod: Making llava tiny via MoE-knowledge distillation,” in ICLR, (Singapore), Apr. 2025

  35. [35]

    Compodistill: Attention distillation for compositional reasoning in multimodal llms,

    J. Kim, K. Kim, S. Seo, and C. Park, “Compodistill: Attention distillation for compositional reasoning in multimodal llms,” inICLR, (Rio de Janeiro, Brazil), 2026

  36. [36]

    Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement,

    Q. Feng, W. Li, T. Lin, and X. Chen, “Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement,” inCVPR, (Nashville, TN, USA), pp. 4178–4188, June 2025

  37. [37]

    EM-KD: distilling efficient multimodal large language model with unbalanced vision tokens,

    Z. Feng, S. Yang, B. Duan, W. Yang, and J. Wang, “EM-KD: distilling efficient multimodal large language model with unbalanced vision tokens,” inAAAI, pp. 21111–21119, Jan. 2026

  38. [38]

    Taekd: Teacher assistant enhanced knowledge distillation for closed-source multilingual neural machine translation,

    B. Lv, X. Liu, K. Wei, P. Luo, and Y . Yu, “Taekd: Teacher assistant enhanced knowledge distillation for closed-source multilingual neural machine translation,” inLREC/COLING, pp. 15530–15541, ELRA and ICCL, May 2024

  39. [39]

    V . N. Vapnik,Statistical Learning Theory. New York, NY , USA: Wiley, September 1998

  40. [40]

    Unifying distillation and privileged information,

    D. Lopez-Paz, L. Bottou, B. Schölkopf, and V . Vapnik, “Unifying distillation and privileged information,” inICLR, May 2016

  41. [41]

    Sigmoid loss for language image pre-training,

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in CVPR, pp. 11941–11952, Oct. 2023. 11

  42. [42]

    Qwen2.5 Technical Report

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

  43. [43]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inCVPR, pp. 6325–6334, 2017

  44. [44]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering,

    D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, pp. 6693–6702, June 2019

  45. [45]

    Towards vqa models that can read,

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, pp. 8309–8318, June 2019

  46. [46]

    Learn to explain: Multimodal reasoning via thought chains for science question answering,

    P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inNeurIPS, Nov./Dec. 2022

  47. [47]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji, “MME: A comprehensive evaluation benchmark for multimodal large language models,”CoRR, vol. abs/2306.13394, 2023

  48. [48]

    Evaluating object hallucination in large vision-language models,

    Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inEMNLP, pp. 292–305, Dec. 2023

  49. [49]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

    X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inCVPR, June 2024. 12