Recognition: unknown
Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency
Pith reviewed 2026-05-08 03:43 UTC · model grok-4.3
The pith
Structured pruning of vision-language models allows recovery of over 95 percent performance with only 5 percent of the original training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that after applying layerwise or widthwise structural pruning to the language model component of LVLMs, a combination of supervised finetuning and hidden-state distillation can recover most of the original performance, and that this recovery succeeds using only 5% of the original data while retaining over 95% of performance. Widthwise pruning maintains better performance in low-resource scenarios, and at small compression levels, finetuning only the multimodal projector is sufficient.
What carries the argument
Structured pruning of the language model backbone (layerwise or widthwise) paired with recovery training via supervised finetuning and knowledge distillation on logits and hidden states.
If this is right
- Widthwise pruning is more robust than layerwise pruning when finetuning data or compute is scarce.
- For mild pruning ratios, updating only the multimodal projector during recovery is enough to restore performance.
- The optimal recovery strategy combines supervised finetuning with distillation of hidden states.
- High performance retention is achievable with recovery training on as little as 5% of the original dataset.
Where Pith is reading between the lines
- This approach could reduce the need for training smaller LVLMs from scratch by instead compressing larger pretrained ones.
- Similar pruning and recovery techniques might apply to other large multimodal architectures beyond the tested 3B-7B range.
- Developers could use this to iteratively prune and recover models to find optimal compression levels without full retraining.
Load-bearing premise
The load-bearing premise is that the pruning dynamics, optimal recovery methods, and data efficiency observed on the three tested LVLM families and benchmarks will hold for other model sizes, tasks, and deployment environments.
What would settle it
Observing that recovery training with 5% data on a new LVLM family or a different multimodal benchmark results in performance retention below 90% of the original would falsify the data-efficiency claim.
read the original abstract
While Large Vision Language Models (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small language models, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the language model backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines structural pruning of LVLMs (3B-7B scale) via layerwise and widthwise methods on the language backbone, followed by recovery via supervised finetuning and logit/hidden-state distillation. It reports that widthwise pruning is more robust under low compute/data, that projector-only finetuning suffices for mild compression, and that recovery training on only 5% of data can retain >95% of original performance across three LVLM families, with code released.
Significance. If the empirical findings hold with proper controls, the work supplies actionable, data-efficient recipes for compressing existing LVLMs without full retraining, which is directly relevant to edge deployment. The multi-family evaluation and open code are strengths that increase the potential utility for practitioners.
major comments (2)
- Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.
- Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.
minor comments (3)
- Clarify in the methods section how the 5% data subset is sampled (random, stratified, or fixed) and whether the same subset is used across all pruning ratios and models.
- Add a table or figure caption that explicitly lists the exact benchmarks, metrics, and original (unpruned) scores for each LVLM family so that the '95% retention' figures can be directly cross-checked.
- The distinction between 'layerwise' and 'widthwise' pruning should be illustrated with a small diagram or pseudocode in §3 to avoid ambiguity for readers unfamiliar with the exact structural cuts.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us identify areas for improvement in our manuscript. We address each major comment below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: Abstract and results on data efficiency: the central claim that 'effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance' is presented without reported standard deviation across multiple random 5% subsets, different seeds, or repeated subsampling. In low-data regimes this omission is load-bearing, as subset choice can materially affect retention; the manuscript must either add such variance statistics or qualify the claim.
Authors: We agree that the absence of variance statistics for the 5% data subset experiments is a limitation, particularly in low-data regimes where subset selection can influence results. Our current experiments used a single random 5% subset for each model family. In the revised manuscript, we will perform additional experiments with multiple random subsets and different seeds, reporting mean performance and standard deviations. This will either support or qualify the central claim regarding data efficiency. revision: yes
-
Referee: Experimental setup (throughout results sections): the abstract and summary report concrete recovery percentages and method rankings, yet the provided text lacks explicit statements on number of runs, variance across seeds, full baseline comparisons (e.g., unstructured pruning, other distillation variants), and hyperparameter search details for the recovery schedules. These omissions prevent full verification of the performance claims and rankings.
Authors: We acknowledge that more explicit details on the experimental setup would enhance reproducibility and verifiability. The manuscript describes the pruning methods, recovery strategies (supervised finetuning, logit and hidden-state distillation), and evaluations across three LVLM families. However, we did not report the number of runs explicitly (experiments were typically run once per configuration due to computational constraints), nor did we include unstructured pruning baselines or exhaustive hyperparameter sweeps. We will revise the experimental setup section to include these details where available, specify the number of runs, add variance if multiple seeds were tested, and provide hyperparameter information. Regarding additional baselines, we will consider including unstructured pruning comparisons if space permits, but our primary focus was on structured pruning as it is more suitable for deployment. revision: partial
Circularity Check
No circularity: purely empirical study with no derivations
full rationale
The manuscript is an empirical study of pruning and recovery on three LVLM families. It reports experimental outcomes from pruning (layerwise/widthwise), recovery via finetuning/distillation, and data-efficiency tests, with all numbers obtained from held-out evaluations after the interventions. No equations, first-principles derivations, or load-bearing claims reduce to fitted parameters or self-citations by construction; the 5% data result is a direct experimental measurement rather than a renamed fit or imported uniqueness theorem. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Structured pruning of the language-model backbone preserves enough multimodal capability that lightweight recovery training can restore most performance.
Reference graph
Works this paper leans on
-
[1]
Mobilevlm : A fast, strong and open vision language assistant for mobile devices
Chu, X., Qiao, L., Lin, X., Xu, S., Yang, Y., Hu, Y., Wei, F., Zhang, X., Zhang, B., Wei, X., et al.: Mobilevlm: A fast, repro- ducible and strong vision language assis- tant for mobile devices. arXiv preprint arXiv:2312.16886 (2023)
-
[2]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L.,et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 24185–24198 (2024)
2024
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306 (2024)
2024
-
[4]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Mar- tinet, X., Lachaux, M.-A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foun- dation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review arXiv 2023
-
[5]
Mas- ter’s thesis, University of Washington (2024)
Jiang, F.: Identifying and mitigating vulner- abilities in llm-integrated applications. Mas- ter’s thesis, University of Washington (2024)
2024
-
[6]
Efficient multimodal learning from data-centric perspective,
He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B.: Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530 (2024)
-
[7]
arXiv preprint arXiv:2403.06199 (2024)
Zhu, M., Zhu, Y., Liu, X., Liu, N., Xu, Z., Shen, C., Peng, Y., Ou, Z., Feng, F., Tang, J.: A comprehensive overhaul of multimodal assistant with small language models. arXiv preprint arXiv:2403.06199 (2024)
-
[8]
In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)
Sun, M., Liu, Z., Bair, A., Kolter, J.Z.: A sim- ple and effective pruning approach for large language models. In: The Twelfth Interna- tional Conference on Learning Representa- tions (2024)
2024
-
[9]
In: International Conference on Machine Learning, pp
Frantar, E., Alistarh, D.: Sparsegpt: Massive language models can be accurately pruned in one-shot. In: International Conference on Machine Learning, pp. 10323–10337 (2023). PMLR
2023
-
[10]
Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., Han, X., Chen, W.: Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853 (2024)
-
[11]
Advances in neural information pro- cessing systems36, 21702–21720 (2023)
Ma, X., Fang, G., Wang, X.: Llm-pruner: On the structural pruning of large language models. Advances in neural information pro- cessing systems36, 21702–21720 (2023)
2023
-
[12]
In: DAGM German Conference on Pattern Recognition, pp
Huang, Y., Thede, L., Mancini, M., Xu, W., Akata, Z.: Investigating structural prun- ing and recovery techniques for compressing multimodal large language models: An empir- ical study. In: DAGM German Conference on Pattern Recognition, pp. 320–336 (2025). Springer
2025
-
[13]
In: European Con- ference on Computer Vision, pp
Kembhavi, A., Salvato, M., Kolve, E., Seo, M., Hajishirzi, H., Farhadi, A.: A diagram is worth a dozen images. In: European Con- ference on Computer Vision, pp. 235–251 (2016). Springer
2016
-
[14]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.- W., Galley, M., Gao, J.: Mathvista: Evalu- ating mathematical reasoning of foundation models in visual contexts. arXiv preprint 18 arXiv:2310.02255 (2023)
work page internal anchor Pith review arXiv 2023
-
[15]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp
Mathew, M., Karatzas, D., Jawahar, C.: Docvqa: A dataset for vqa on document images. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pp. 2200–2209 (2021)
2021
-
[16]
Dong, X., Chen, S., Pan, S.J.: Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon (2017). https://arxiv. org/abs/1705.07565
-
[17]
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Frankle, J., Carbin, M.: The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neu- ral Networks (2019). https://arxiv.org/abs/ 1803.03635
work page Pith review arXiv 2019
-
[18]
A signal propagation perspective for pruning neural networks at initialization
Lee, N., Ajanthan, T., Gould, S., Torr, P.H.S.: A Signal Propagation Perspective for Prun- ing Neural Networks at Initialization (2020). https://arxiv.org/abs/1906.06307
-
[19]
https://arxiv.org/abs/ 2002.04809
Park, S., Lee, J., Mo, S., Shin, J.: Lookahead: A Far-Sighted Alternative of Magnitude- based Pruning (2020). https://arxiv.org/abs/ 2002.04809
-
[20]
Movement pruning: Adaptive sparsity by fine-tuning
Sanh, V., Wolf, T., Rush, A.M.: Movement Pruning: Adaptive Sparsity by Fine-Tuning (2020). https://arxiv.org/abs/2005.07683
-
[21]
In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp
Farina, M., Mancini, M., Cunegatti, E., Liu, G., Iacca, G., Ricci, E.: Multiflow: Shifting towards task-agnostic vision-language prun- ing. In: Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pp. 16185–16195 (2024)
2024
-
[22]
In: International Conference on Learning Representations (2021)
Zhou, A., Ma, Y., Zhu, J., Liu, J., Zhang, Z., Yuan, K., Sun, W., Li, H.: Learning n: M fine- grained structured sparse neural networks from scratch. In: International Conference on Learning Representations (2021)
2021
-
[23]
Journal of Machine Learning Research23, 1–124 (2021)
Hoefler, T., Alistarh, D., Ben-Nun, T., Dry- den, N., Peste, A.: Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. Journal of Machine Learning Research23, 1–124 (2021)
2021
-
[24]
In: International Conference on Learning Representations (2017)
Li, H., Kadav, A., Durdanovic, I., Samet, H., Graf, H.P.: Pruning filters for efficient convnets. In: International Conference on Learning Representations (2017)
2017
-
[25]
https://arxiv.org/abs/2108.00708
Liu, L., Zhang, S., Kuang, Z., Zhou, A., Xue, J.-H., Wang, X., Chen, Y., Yang, W., Liao, Q., Zhang, W.: Group Fisher Pruning for Practical Network Compression (2021). https://arxiv.org/abs/2108.00708
-
[26]
https://arxiv.org/ abs/1909.08174
You, Z., Yan, K., Ye, J., Ma, M., Wang, P.: Gate Decorator: Global Filter Pruning Method for Accelerating Deep Convolutional Neural Networks (2019). https://arxiv.org/ abs/1909.08174
-
[27]
Zhang, C., Bengio, S., Singer, Y.: Are all layers created equal? Journal of Machine Learning Research23(67), 1–28 (2022)
2022
-
[28]
In: The Thirteenth International Conference on Learning Representations (2024)
Chen, X., Hu, Y., Zhang, J., Wang, Y., Li, C., Chen, H.: Streamlining redundant lay- ers to compress large language models. In: The Thirteenth International Conference on Learning Representations (2024)
2024
-
[29]
In: Forty-first International Conference on Machine Learning (2024)
Song, J., Oh, K., Kim, T., Kim, H., Kim, Y., et al.: Sleb: Streamlining llms through redun- dancy verification and elimination of trans- former blocks. In: Forty-first International Conference on Machine Learning (2024)
2024
-
[30]
https:// arxiv.org/abs/2402.05406
Dery, L., Kolawole, S., Kagy, J.-F., Smith, V., Neubig, G., Talwalkar, A.: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes (2024). https:// arxiv.org/abs/2402.05406
-
[31]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Fang, G., Ma, X., Song, M., Mi, M.B., Wang, X.: Depgraph: Towards any structural pruning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16091–16101 (2023)
2023
-
[32]
arXiv preprint arXiv:2310.06694 , year=
Xia, M., Gao, T., Zeng, Z., Chen, D.: Sheared LLaMA: Accelerating Language Model Pre- training via Structured Pruning (2024). https://arxiv.org/abs/2310.06694
-
[33]
Distilling the Knowledge in a Neural Network
Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network (2015). https://arxiv.org/abs/1503.02531 19
work page internal anchor Pith review arXiv 2015
-
[34]
Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: A survey. Interna- tional Journal of Computer Vision129(6), 1789–1819 (2021) https://doi.org/10.1007/ s11263-021-01453-z
2021
-
[35]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2020). https://arxiv.org/abs/1910.01108
work page internal anchor Pith review arXiv 2020
-
[36]
Liang, K.J., Hao, W., Shen, D., Zhou, Y., Chen, W., Chen, C., Carin, L.: MixKD: Towards Efficient Distillation of Large-scale Language Models (2021). https://arxiv.org/ abs/2011.00593
-
[37]
Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., Pfister, T.: Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes (2023). https://arxiv.org/abs/2305.02301
-
[38]
https://arxiv.org/abs/1909
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: TinyBERT: Dis- tilling BERT for Natural Language Under- standing (2020). https://arxiv.org/abs/1909. 10351
2020
-
[39]
URL https://www.aclweb.org/anthology/D13-1170
Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient Knowledge Distillation for BERT Model Compression (2019). https://arxiv.org/abs/ 1908.09355
-
[40]
https: //arxiv.org/abs/2002.10957
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (2020). https: //arxiv.org/abs/2002.10957
-
[41]
Wang, W., Bao, H., Huang, S., Dong, L., Wei, F.: MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pre- trained Transformers (2021). https://arxiv. org/abs/2012.15828
-
[42]
MiniLLM: On-Policy Distillation of Large Language Models
Gu, Y., Dong, L., Wei, F., Huang, M.: Knowl- edge distillation of large language models. arXiv preprint arXiv:2306.08543 (2023)
work page internal anchor Pith review arXiv 2023
-
[43]
In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol
Hoffmann, J., Agnihotri, S., Saikia, T., Brox, T.: Towards improving robustness of com- pressed cnns. In: ICML Workshop on Uncer- tainty and Robustness in Deep Learning (UDL), vol. 4 (2021)
2021
-
[44]
Advances in Neural Information Processing Systems 37, 41076–41102 (2024)
Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P.: Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems 37, 41076–41102 (2024)
2024
-
[45]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp
Cai, Y., Zhang, J., He, H., He, X., Tong, A., Gan, Z., Wang, C., Xue, Z., Liu, Y., Bai, X.: Llava-kd: A framework of distilling multi- modal large language models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 239–249 (2025)
2025
-
[46]
CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs
Kim, J., Kim, K., Seo, S., Park, C.: Com- podistill: Attention distillation for composi- tional reasoning in multimodal llms. arXiv preprint arXiv:2510.12184 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp
Shang, Y., Cai, M., Xu, B., Lee, Y.J., Yan, Y.: Llava-prumerge: Adaptive token reduc- tion for efficient large multimodal models. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pp. 22857–22867 (2025)
2025
-
[48]
In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp
Tang, Z., Ma, Z., Wang, S., Li, Z., Zhang, L., Zhao, H., Li, Y., Wang, Q.: Covipal: Layer- wise contextualized visual token pruning for large vision-language models. In: Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp. 20701–20714 (2025)
2025
-
[49]
arXiv preprint arXiv:2509.23931 (2025)
Wang, H., Xu, Y., Xu, Z., Gao, J., Liu, Y., Hu, W., Wang, K., Zhang, Z.: Autoprune: Each complexity deserves a pruning policy. arXiv preprint arXiv:2509.23931 (2025)
-
[50]
In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp
Zhu, Y., Zhu, M., Liu, N., Xu, Z., Peng, Y.: Llava-phi: Efficient multi-modal assistant with small language model. In: Proceedings of the 1st International Workshop on Efficient Multimedia Computing Under Limited, pp. 18–22 (2024) 20
2024
-
[51]
arXiv preprint arXiv:1909.11556 , year=
Fan, A., Grave, E., Joulin, A.: Reduc- ing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019)
-
[52]
Computer Speech & Language77, 101429 (2023)
Sajjad, H., Dalvi, F., Durrani, N., Nakov, P.: On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language77, 101429 (2023)
2023
-
[53]
In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp
Voita, E., Talbot, D., Moiseev, F., Sen- nrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5797–5808 (2019)
2019
-
[54]
Michel, P., Levy, O., Neubig, G.: Are six- teen heads really better than one? Advances in neural information processing systems32 (2019)
2019
-
[55]
arXiv preprint arXiv:1910.06360 (2019)
McCarley, J., Chakravarti, R., Sil, A.: Structured pruning of a bert-based ques- tion answering model. arXiv preprint arXiv:1910.06360 (2019)
-
[56]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
2019
-
[57]
Karamcheti, S., Nair, S., Balakrishna, A., Liang, P., Kollar, T., Sadigh, D.: Pris- matic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865 (2024)
-
[58]
The Curious Case of Neural Text Degeneration
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neu- ral text degeneration. arXiv preprint arXiv:1904.09751 (2019)
work page internal anchor Pith review arXiv 1904
-
[59]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp
Yang, C., An, Z., Huang, L., Bi, J., Yu, X., Yang, H., Diao, B., Xu, Y.: Clip-kd: An empirical study of clip model distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15952–15962 (2024)
2024
-
[60]
arXiv preprint arXiv:2404.16637 (2024)
Popp, N., Metzen, J.H., Hein, M.: Zero-shot distillation for image encoders: How to make effective use of synthetic data. arXiv preprint arXiv:2404.16637 (2024)
-
[61]
In: Proceedings of CVPR (2024)
Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., Chen, W.: Mmmu: A massive multi-discipline multimodal under- standing and reasoning benchmark for expert agi. In: Proceedings of CVPR (2024)
2024
-
[62]
Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)
Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science ques- tion answering. Advances in Neural Infor- mation Processing Systems35, 2507–2521 (2022)
2022
-
[63]
Evaluating Object Hallucination in Large Vision-Language Models
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.-R.: Evaluating object halluci- nation in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
work page internal anchor Pith review arXiv 2023
-
[64]
A survey on multimodal large language models
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multi- modal large language models. arXiv preprint arXiv:2306.13549 (2023)
-
[65]
Zenodo (2024)
Bo, L., Peiyuan, Z., Kaichen, Z., Fanyi, P., Xinrun, D., Yuhao, D., Haotian, L., Yuanhan, Z., Ge, Z., Chunyuan, L., Ziwei, L.: LMMs- Eval: Accelerating the Development of Large Multimoal Models. Zenodo (2024). https:// github.com/EvolvingLMMs-Lab/lmms-eval
2024
-
[66]
Llm pruning and distillation in practice: The minitron approach.arXiv preprint arXiv:2408.11796,
Sreenivas, S.T., Muralidharan, S., Joshi, R., Chochowski, M., Mahabaleshwarkar, A.S., Shen, G., Zeng, J., Chen, Z., Suhara, Y., Diao, S., et al.: Llm pruning and distillation in practice: The minitron approach. arXiv preprint arXiv:2408.11796 (2024)
-
[67]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al.: Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021) 21
work page internal anchor Pith review arXiv 2021
-
[68]
In: Conference on Learning Theory (COLT), pp
Telgarsky, M.: Benefits of depth in neural net- works. In: Conference on Learning Theory (COLT), pp. 1517–1539 (2016)
2016
-
[69]
In: Forty- first International Conference on Machine Learning (2024)
Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R.,et al.: Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. In: Forty- first International Conference on Machine Learning (2024)
2024
-
[70]
In: International Conference on Learning Representations (2021)
Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language mod- els. In: International Conference on Learning Representations (2021)
2021
-
[71]
int8 (): 8-bit matrix multi- plication for transformers at scale
Dettmers, T., Lewis, M., Belkada, Y., Zettle- moyer, L.: Gpt3. int8 (): 8-bit matrix multi- plication for transformers at scale. Advances in Neural Information Processing Systems 35, 30318–30332 (2022) 22
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.