arxiv: 2605.10641 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

Nikolaos Gkalelis, Vasileios Mezaris

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords knowledge distillationvision-language modelsvisual question answeringmodel compressioncascaded distillation

0 comments

The pith

Cascaded distillation with intermediate teachers narrows the capacity gap between large vision-language models and their smaller counterparts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a bottom-up cascaded knowledge distillation framework that inserts one or more intermediate-capacity teacher models between a high-capacity teacher and a compact student. This gradual progression counters the degradation in knowledge transfer that occurs when the capacity difference becomes too large. The authors supply a theoretical analysis of how the cascade affects the student's generalization and then apply the method to LLaVA-style vision-language models. On seven public visual-question-answering benchmarks the resulting students reach state-of-the-art accuracy while remaining small enough for practical deployment.

Core claim

Instead of distilling directly from one large teacher to a much smaller student, the bottom-up cascaded framework introduces intermediate teachers that successively raise the student's capacity level until the final high-capacity teacher can transfer its knowledge effectively, yielding models that outperform prior distillation baselines on standard VQA tasks.

What carries the argument

The bottom-up cascaded knowledge distillation (CKD) process, which chains teachers of increasing capacity to bridge the gap to the student model.

Load-bearing premise

Adding intermediate teachers improves knowledge transfer without creating new optimization difficulties or harming generalization.

What would settle it

A head-to-head comparison on the same seven VQA benchmarks in which direct single-teacher distillation from the largest model to the smallest student matches or exceeds the cascaded version.

Figures

Figures reproduced from arXiv: 2605.10641 by Nikolaos Gkalelis, Vasileios Mezaris.

read the original abstract

Large Vision-Language Models (VLMs) are successful in addressing a multitude of vision-language understanding tasks, such as Visual Question Answering (VQA), but their memory and compute requirements remain a concern for practical deployment. A promising class of techniques for mitigating this concern is Knowledge Distillation, where knowledge from a high-capacity Teacher network is transferred to a considerably smaller Student network. However, the capacity gap between the two networks is both a blessing and a curse: the smaller the Student network, the better its efficiency, and the larger the Teacher, the more knowledge it carries; yet, beyond a point, the larger capacity gap between the two leads to worse knowledge transfer. To counter this effect, we propose a bottom-up cascaded knowledge distillation (CKD) framework. Instead of treating knowledge transfer as an activity involving one high-capacity Teacher (or an ensemble of such), inspired by human formal education systems, we introduce one (potentially, more) additional Teacher(s) of intermediate capacity that gradually bring the Student network to the next level, where the next (higher-capacity) Teacher can take over. We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student. We apply the proposed framework on models build upon the LLaVA methodology and evaluate the derived models on seven standard, publicly available VQA benchmarks, demonstrating their SotA performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cascaded distillation gives VLMs a structured way to handle capacity gaps, but the gains aren't isolated from other training factors.

read the letter

The new element is the bottom-up cascaded knowledge distillation setup that inserts intermediate-capacity teachers between a large VLM teacher and a small student. This draws from an education-system analogy and adds a theoretical analysis of how the cascade affects the student's generalization. They build on LLaVA models and test the resulting students on seven public VQA benchmarks, reporting state-of-the-art numbers for the compressed versions. That gives the work a clear practical direction for making these models lighter without losing too much capability. The framing is distinct from single-teacher or ensemble distillation, so it could be useful as a template for other multimodal compression tasks. The theoretical piece is also welcome, since many distillation papers skip any analysis of generalization bounds. The softer spot is the evidence that the cascade itself is what drives the improvement. The central assumption is that direct high-to-low transfer suffers from the capacity gap and that intermediates fix it without adding optimization problems or error buildup. Yet the reported results do not include an ablation that holds total teacher-student training compute fixed while changing only the number of stages. Without that, the SOTA numbers could come from extra hyperparameter search, data ordering, or simply more overall training rather than the progressive structure. The theoretical analysis would be stronger if it were shown to be tight under the specific contrastive and language-modeling losses used in LLaVA training. This paper is for researchers working on efficient vision-language models and distillation methods. Anyone trying to shrink large VLMs for deployment will find the framework and benchmark results worth looking at. It deserves a serious referee because the proposal is coherent, the experiments use standard tasks, and the theory adds something beyond pure empiricism. I'd send it to review and ask for the compute-matched ablations in revisions.

Referee Report

3 major / 2 minor

Summary. The paper proposes LLaVA-CKD, a bottom-up cascaded knowledge distillation (CKD) framework for compressing large vision-language models based on the LLaVA methodology. It introduces one or more intermediate-capacity teachers to gradually bridge the capacity gap between a high-capacity teacher and a smaller student, provides a theoretical analysis of the effect of cascaded distillation on student generalization performance, and reports state-of-the-art results on seven standard publicly available VQA benchmarks.

Significance. If the central claims hold, the work could meaningfully advance practical deployment of VLMs by improving knowledge transfer efficiency in distillation without requiring ensembles of large teachers. The attempt at a theoretical analysis of generalization effects and evaluation on public benchmarks are strengths that would support broader adoption if the gains are shown to stem specifically from the cascade structure rather than ancillary factors.

major comments (3)

[Theoretical Analysis] The theoretical analysis is invoked to study cascaded distillation's impact on student generalization, yet the manuscript provides no derivation or tightness argument showing that the bound holds under LLaVA's specific training regime (contrastive loss plus language modeling on image-text pairs). This leaves the load-bearing claim that intermediate teachers improve generalization without new pathologies unverified.
[Experiments] The empirical SOTA claims on the seven VQA benchmarks rest on comparisons that do not isolate the cascade structure; no ablation holds total teacher-student training FLOPs fixed while varying only the number of distillation stages. Without this control, observed gains could arise from differences in compute budget, data ordering, or hyper-parameters rather than the proposed bottom-up framework.
[Method] The core assumption that inserting intermediate-capacity teachers reliably bridges the capacity gap without introducing optimization issues or error propagation is stated but not subjected to controlled tests (e.g., direct high-to-low vs. cascaded under matched total compute). This assumption is load-bearing for both the method and the generalization claims.

minor comments (2)

[Method] Notation for the number and capacities of intermediate teachers should be formalized with explicit variables rather than left as 'one (potentially, more) additional Teacher(s)'.
[Introduction] The abstract and introduction would benefit from a concise statement of the exact LLaVA variants used as teacher, intermediate, and student models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where the concerns identify gaps in the presented evidence.

read point-by-point responses

Referee: [Theoretical Analysis] The theoretical analysis is invoked to study cascaded distillation's impact on student generalization, yet the manuscript provides no derivation or tightness argument showing that the bound holds under LLaVA's specific training regime (contrastive loss plus language modeling on image-text pairs). This leaves the load-bearing claim that intermediate teachers improve generalization without new pathologies unverified.

Authors: We appreciate the referee's emphasis on rigor. Section 3 of the manuscript presents a generalization bound for cascaded distillation that demonstrates how intermediate teachers reduce the effective capacity gap under standard assumptions on loss smoothness and data distribution. The analysis is formulated generally to apply across distillation settings. However, we acknowledge that an explicit derivation instantiating the bound for the precise combination of contrastive loss and autoregressive language modeling used in LLaVA is not provided, nor is a tightness argument given for this regime. We will revise the paper to include this derivation in Section 3 (or an expanded appendix) together with a discussion of tightness under the LLaVA objective. revision: yes
Referee: [Experiments] The empirical SOTA claims on the seven VQA benchmarks rest on comparisons that do not isolate the cascade structure; no ablation holds total teacher-student training FLOPs fixed while varying only the number of distillation stages. Without this control, observed gains could arise from differences in compute budget, data ordering, or hyper-parameters rather than the proposed bottom-up framework.

Authors: We agree that isolating the contribution of the cascade structure requires tighter controls. The reported experiments compare LLaVA-CKD against direct-distillation and other baselines while attempting to keep overall training resources comparable, but we did not include an ablation that explicitly fixes total teacher-student FLOPs and varies only the number of stages. We will add this controlled ablation to the experimental section of the revised manuscript to demonstrate that performance differences arise from the cascaded structure itself. revision: yes
Referee: [Method] The core assumption that inserting intermediate-capacity teachers reliably bridges the capacity gap without introducing optimization issues or error propagation is stated but not subjected to controlled tests (e.g., direct high-to-low vs. cascaded under matched total compute). This assumption is load-bearing for both the method and the generalization claims.

Authors: This concern is valid. The manuscript includes direct comparisons of cascaded versus single-stage distillation, yet these comparisons were not performed under strictly matched total compute budgets. We will add new experiments in the revised version that directly contrast high-to-low distillation against the cascaded approach while holding total training FLOPs constant, thereby testing for optimization difficulties or error propagation. revision: yes

Circularity Check

0 steps flagged

No circularity: CKD framework and generalization analysis are independent proposals with external empirical validation.

full rationale

The paper introduces a bottom-up cascaded knowledge distillation framework as a novel proposal, accompanied by a theoretical analysis of its impact on student generalization performance. No equations, predictions, or central claims reduce by construction to fitted parameters, self-defined quantities, or load-bearing self-citations from the authors' prior work. The LLaVA-based implementation and SotA results on seven VQA benchmarks are presented as downstream empirical outcomes rather than tautological restatements of inputs. This satisfies the default expectation of self-contained derivation without circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that capacity gaps degrade distillation and that intermediate models can be chosen to mitigate this; no free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)

Capacity and number of intermediate teachers
Must be selected to create a gradual progression; values are not stated in abstract but are central to the method.

axioms (1)

domain assumption Beyond a point, larger capacity gap between teacher and student leads to worse knowledge transfer
Explicitly stated as motivation in the abstract.

pith-pipeline@v0.9.0 · 5555 in / 1164 out tokens · 33162 ms · 2026-05-12T05:07:47.895712+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We provide a theoretical analysis in order to study the effect of cascaded distillation in the generalization performance of the Student... R(f_s)-R(f) ≤ O(|F_s|^C + |F_t|^C / n^{a_st}) + ε_st + ε_t

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 3 internal anchors

[1]

A comprehensive survey and guide to multimodal large language models in vision-language tasks

C. X. Liang, P. Tian, C. H. Yin, Y . Yua, W. An-Hou, L. Ming, T. Wang, Z. Bi, and M. Liu, “A com- prehensive survey and guide to multimodal large language models in vision-language tasks,”CoRR, vol. abs/2411.06284, 2024

work page arXiv 2024
[2]

Large multimodal agents: a survey,

J. Xie, Z. Chen, R. Zhang, and G. Li, “Large multimodal agents: a survey,”Vis. Intell., vol. 3, Nov. 2025

work page 2025
[3]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”CoRR, vol. abs/2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Training compute-optimal large language models,

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, T. Hennigan, E. Noland, K. Millican, G. van den Driessche, B. Damoc, A. Guy, S. Osindero, K. Simonyan, E. Elsen, O. Vinyals, J. W. Rae, and L. Sifre, “Training compute-optimal large language models,” inNeurIPS, 2022

work page 2022
[5]

Efficient multimodal large language models: a survey,

Y . Jin, J. Li, T. Gu, Y . Liu, B. Zhao, J. Lai, Z. Gan, Y . Wang, C. Wang, X. Tan, and L. Ma, “Efficient multimodal large language models: a survey,”Vis. Intell., vol. 3, Dec. 2025

work page 2025
[6]

Vision-language models for edge networks: A comprehensive survey,

A. Sharshar, L. U. Khan, W. Ullah, and M. Guizani, “Vision-language models for edge networks: A comprehensive survey,”IEEE IoT-J, vol. 12, no. 16, pp. 32701–32724, 2025

work page 2025
[7]

Llava-kd: A framework of distilling multimodal large language models,

Y . Cai, J. Zhang, H. He, X. He, A. Tong, Z. Gan, C. Wang, Z. Xue, Y . Liu, and X. Bai, “Llava-kd: A framework of distilling multimodal large language models,” inICCV, pp. 239–249, Oct. 2025

work page 2025
[8]

Distillation scaling laws,

D. Busbridge, A. Shidani, F. Weers, J. Ramapuram, E. Littwin, and R. Webb, “Distillation scaling laws,” in ICML, July 2025

work page 2025
[9]

Improved knowl- edge distillation via teacher assistant,

S.-I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowl- edge distillation via teacher assistant,” inAAAI, 2020

work page 2020
[10]

Densely guided knowledge distillation using multiple teacher assistants,

W. Son, J. Na, J. Choi, and W. Hwang, “Densely guided knowledge distillation using multiple teacher assistants,” inICCV, pp. 9375–9384, Oct. 2021

work page 2021
[11]

Improved knowledge distillation via teacher assistants for sentiment analysis,

X. Dong, O. Huang, P. Thulasiraman, and A. Mahanti, “Improved knowledge distillation via teacher assistants for sentiment analysis,” inIEEE SSCI, pp. 300–305, Dec. 2023

work page 2023
[12]

Efficient multimodal large language models: A survey,

Y . Jin, J. Li, Y . Liu, T. Gu, K. Wu, Z. Jiang, M. He, B. Zhao, X. Tan, Z. Gan, Y . Wang, C. Wang, and L. Ma, “Efficient multimodal large language models: A survey,”Vis. Intell., vol. 3, Nov. 2025

work page 2025
[13]

Mobilevlm : A fast, strong and open vision language assistant for mobile devices

X. Chu, L. Qiao, X. Lin, S. Xu, Y . Yang, Y . Hu, F. Wei, X. Zhang, B. Zhang, X. Wei, and C. Shen, “MobileVLM : A fast, strong and open vision language assistant for mobile devices,”CoRR, vol. abs/2312.16886, 2023

work page arXiv 2023
[14]

Mobilevlm v2: Faster and stronger baseline for vision language model.arXiv preprint arXiv:2402.03766, 2024

X. Chu, L. Qiao, X. Zhang, S. Xu, F. Wei, Y . Yang, X. Sun, Y . Hu, X. Lin, B. Zhang, and C. Shen, “MobileVLM V2: Faster and stronger baseline for vision language model,”CoRR, vol. abs/2402.03766, 2024

work page arXiv 2024
[15]

Visual instruction tuning,

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” inNeurIPS, Dec. 2023

work page 2023
[16]

Tinyllava: A framework of small-scale large multimodal models,

B. Zhou, Y . Hu, X. Weng, J. Jia, J. Luo, X. Liu, J. Wu, and L. Huang, “Tinyllava: A framework of small-scale large multimodal models,”CoRR, vol. abs/2402.14289, 2024

work page arXiv 2024
[17]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inCVPR, pp. 26286– 26296, June 2024

work page 2024
[18]

Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

G. H. Chen, S. Chen, R. Zhang, J. Chen, X. Wu, Z. Zhang, Z. Chen, J. Li, X. Wan, and B. Wang, “Allava: Harnessing GPT4V-synthesized data for A lite vision-language model,”CoRR, vol. abs/2402.11684, 2024

work page arXiv 2024
[19]

LLaV A-MORE: A comparative study of LLMs and visual backbones for enhanced visual instruction tuning,

F. Cocchi, N. Moratelli, D. Caffagni, S. Sarto, L. Baraldi, M. Cornia, and R. Cucchiara, “LLaV A-MORE: A comparative study of LLMs and visual backbones for enhanced visual instruction tuning,” inICCVW, pp. 4337–4347, Oct. 2025. 10

work page 2025
[20]

Imp: Highly capable large multimodal models for mobile devices,

Z. Shao, Z. Yu, J. Yu, X. Ouyang, L. Zheng, Z. Gai, M. Wang, and J. Ding, “Imp: Highly capable large multimodal models for mobile devices,”CoRR, vol. abs/2405.12107, 2024

work page arXiv 2024
[21]

Imp: Highly capable large multimodal models for mobile devices,

Z. Shao, Z. Yu, J. Yu, X. Ouyang, L. Zheng, Z. Gai, M. Wang, Z. Kuang, and J. Ding, “Imp: Highly capable large multimodal models for mobile devices,”IEEE TMM, vol. 27, pp. 2961–2974, 2025

work page 2025
[22]

Efficient multimodal learning from data-centric perspective,

M. He, Y . Liu, B. Wu, J. Yuan, Y . Wang, T. Huang, and B. Zhao, “Efficient multimodal learning from data-centric perspective,”CoRR, vol. abs/2402.11530, 2024

work page arXiv 2024
[23]

SPHINX-x: Scaling data and parameters for a family of multi-modal large language models,

D. Liu, R. Zhang, L. Qiu, S. Huang, W. Lin, S. Zhao, S. Geng, Z. Lin, P. Jin, K. Zhang, W. Shao, C. Xu, C. He, J. He, H. Shao, P. Lu, Y . Qiao, H. Li, and P. Gao, “SPHINX-x: Scaling data and parameters for a family of multi-modal large language models,” inICML, vol. 235, pp. 32400–32420, July 2024

work page 2024
[24]

Layer-wise vision injection with disentangled attention for efficient LVLMs,

X. Zhang, D. Li, B. Liu, Z. Bao, Y . Zhou, B. Yang, Z. Liu, Y . Zhong, and T. Yuan, “Layer-wise vision injection with disentangled attention for efficient LVLMs,” inICCV, pp. 239–249, Oct. 2025

work page 2025
[25]

MoE-LLaV A: Mixture of experts for large vision-language models,

B. Lin, Z. Tang, Y . Ye, J. Huang, J. Zhang, Y . Pang, P. Jin, M. Ning, J. Luo, and L. Yuan, “MoE-LLaV A: Mixture of experts for large vision-language models,”IEEE TMM, pp. 1–14, 2026

work page 2026
[26]

Mini-gemini: Mining the potential of multi-modality vision language models,

Y . Li, Y . Zhang, C. Wang, Z. Zhong, Y . Chen, R. Chu, S. Liu, and J. Jia, “Mini-gemini: Mining the potential of multi-modality vision language models,”IEEE TPAMI, vol. 48, no. 3, pp. 3530–3543, 2026

work page 2026
[27]

Q-VLM: post-training quantization for large vision-language models,

C. Wang, Z. Wang, X. Xu, Y . Tang, J. Zhou, and J. Lu, “Q-VLM: post-training quantization for large vision-language models,” inNeurIPS, Dec. 2024

work page 2024
[28]

MBQ: modality-balanced quantization for large vision-language models,

S. Li, Y . Hu, X. Ning, X. Liu, K. Hong, X. Jia, X. Li, Y . Yan, P. Ran, G. Dai, S. Yan, H. Yang, and Y . Wang, “MBQ: modality-balanced quantization for large vision-language models,” inCVPR, pp. 4167–4177, June 2025

work page 2025
[29]

Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models,

Y . Sung, J. Yoon, and M. Bansal, “Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models,” inICLR, 2024

work page 2024
[30]

MULTIFLOW: shifting towards task-agnostic vision-language pruning,

M. Farina, M. Mancini, E. Cunegatti, G. Liu, G. Iacca, and E. Ricci, “MULTIFLOW: shifting towards task-agnostic vision-language pruning,” inCVPR, pp. 16185–16195, June 2024

work page 2024
[31]

Langvision-lora-nas: Neural architecture search for variable lora rank in vision language models,

K. T. Chitty-Venkata, M. Emani, and V . Vishwanath, “Langvision-lora-nas: Neural architecture search for variable lora rank in vision language models,” inICIP, pp. 1330–1335, Sept. 2025

work page 2025
[32]

LLA V ADI: what matters for multimodal large language models distillation,

S. Xu, X. Li, H. Yuan, L. Qi, Y . Tong, and M. Yang, “LLA V ADI: what matters for multimodal large language models distillation,”CoRR, vol. abs/2407.19409, 2024

work page arXiv 2024
[33]

Move-kd: Knowledge distillation for vlms with mixture of visual encoders,

J. Cao, Y . Zhang, T. Huang, M. Lu, Q. Zhang, R. An, N. Ma, and S. Zhang, “Move-kd: Knowledge distillation for vlms with mixture of visual encoders,” inCVPR, pp. 19846–19856, June 2025

work page 2025
[34]

Llava-mod: Making llava tiny via MoE-knowledge distillation,

F. Shu, Y . Liao, L. Zhang, L. Zhuo, C. Xu, G. Zhang, H. Shi, L. Chan, T. Zhong, Z. Yu, W. He, S. Fu, H. Li, S. Liu, H. Li, and H. Jiang, “Llava-mod: Making llava tiny via MoE-knowledge distillation,” in ICLR, (Singapore), Apr. 2025

work page 2025
[35]

Compodistill: Attention distillation for compositional reasoning in multimodal llms,

J. Kim, K. Kim, S. Seo, and C. Park, “Compodistill: Attention distillation for compositional reasoning in multimodal llms,” inICLR, (Rio de Janeiro, Brazil), 2026

work page 2026
[36]

Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement,

Q. Feng, W. Li, T. Lin, and X. Chen, “Align-kd: Distilling cross-modal alignment knowledge for mobile vision-language large model enhancement,” inCVPR, (Nashville, TN, USA), pp. 4178–4188, June 2025

work page 2025
[37]

EM-KD: distilling efficient multimodal large language model with unbalanced vision tokens,

Z. Feng, S. Yang, B. Duan, W. Yang, and J. Wang, “EM-KD: distilling efficient multimodal large language model with unbalanced vision tokens,” inAAAI, pp. 21111–21119, Jan. 2026

work page 2026
[38]

Taekd: Teacher assistant enhanced knowledge distillation for closed-source multilingual neural machine translation,

B. Lv, X. Liu, K. Wei, P. Luo, and Y . Yu, “Taekd: Teacher assistant enhanced knowledge distillation for closed-source multilingual neural machine translation,” inLREC/COLING, pp. 15530–15541, ELRA and ICCL, May 2024

work page 2024
[39]

V . N. Vapnik,Statistical Learning Theory. New York, NY , USA: Wiley, September 1998

work page 1998
[40]

Unifying distillation and privileged information,

D. Lopez-Paz, L. Bottou, B. Schölkopf, and V . Vapnik, “Unifying distillation and privileged information,” inICLR, May 2016

work page 2016
[41]

Sigmoid loss for language image pre-training,

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer, “Sigmoid loss for language image pre-training,” in CVPR, pp. 11941–11952, Oct. 2023. 11

work page 2023
[42]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering,

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” inCVPR, pp. 6325–6334, 2017

work page 2017
[44]

Gqa: A new dataset for real-world visual reasoning and compositional question answering,

D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” inCVPR, pp. 6693–6702, June 2019

work page 2019
[45]

Towards vqa models that can read,

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” inCVPR, pp. 8309–8318, June 2019

work page 2019
[46]

Learn to explain: Multimodal reasoning via thought chains for science question answering,

P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” inNeurIPS, Nov./Dec. 2022

work page 2022
[47]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

C. Fu, P. Chen, Y . Shen, Y . Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and R. Ji, “MME: A comprehensive evaluation benchmark for multimodal large language models,”CoRR, vol. abs/2306.13394, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Evaluating object hallucination in large vision-language models,

Y . Li, Y . Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” inEMNLP, pp. 292–305, Dec. 2023

work page 2023
[49]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sun, C. Wei, B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng, Z. Yang, Y . Liu, W. Huang, H. Sun, Y . Su, and W. Chen, “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” inCVPR, June 2024. 12

work page 2024