ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Pith reviewed 2026-05-23 22:17 UTC · model grok-4.3
The pith
A pipeline using GPT-4V to create 1.3 million synthetic samples lets lite vision-language models reach competitive results on 17 benchmarks and match larger models on several tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that high-quality synthetic data generated by a strong proprietary model can substitute for larger model scale: a 1.3M-sample dataset of fine-grained image annotations and complex reasoning VQA pairs produced by GPT-4V enables lite VLMs to achieve competitive performance on 17 benchmarks among 4B-scale models and to perform on par with 7B/13B-scale models on various benchmarks.
What carries the argument
The two-stage synthetic data pipeline that first generates fine-grained image annotations for vision-language alignment and then produces complex reasoning visual question-answering pairs for visual instruction fine-tuning.
If this is right
- Lite VLMs trained on the synthetic dataset reach competitive results among 4B-scale models across 17 benchmarks.
- The same models perform on par with 7B- and 13B-scale models on various benchmarks.
- High-quality synthetic data reduces the need for massive human-labeled corpora when building resource-efficient LVLMs.
- The approach demonstrates that data synthesis can make vision-language capabilities available with lower training and deployment costs.
Where Pith is reading between the lines
- Open-sourcing the ALLaVA dataset lets independent groups replicate or extend the efficiency gains without repeating the GPT-4V generation step.
- The same synthesis method could be applied to other multimodal tasks such as image captioning or visual reasoning to test whether data quality continues to substitute for scale.
- Combining the synthetic pre-training with later fine-tuning on real user data might further close remaining gaps on tasks where synthetic artifacts remain visible.
Load-bearing premise
The GPT-4V generated annotations and VQA pairs are of high enough quality and free of systematic artifacts that training on them produces genuine capability gains rather than benchmark-specific overfitting.
What would settle it
A large performance drop on a fresh collection of vision-language benchmarks that were never used during data generation or model selection would indicate the gains come from overfitting to the synthetic data distribution.
read the original abstract
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ALLaVA, a 1.3M-sample synthetic dataset generated via GPT-4V to produce fine-grained image annotations for vision-language alignment and complex reasoning VQA pairs for instruction tuning. Lite VLMs (around 4B scale) are trained on this data and reported to achieve competitive results on 17 benchmarks relative to other 4B LVLMs, with parity to some 7B/13B models on selected tasks, demonstrating that high-quality synthetic data can close the performance gap for resource-efficient models.
Significance. If the gains are attributable to genuine capability rather than artifacts, the work would show that carefully synthesized data from strong proprietary models can enable smaller VLMs to approach the performance of much larger ones, with direct implications for lowering training and inference costs in multimodal systems.
major comments (2)
- [§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.
- [Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.
minor comments (2)
- [§4] Notation for model scales (e.g., '4B LVLMs') should be defined consistently with parameter counts of the trained models and baselines.
- [§3] The pipeline description would benefit from an explicit diagram or table enumerating the exact prompts and filtering steps used to produce the 1.3M samples.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The two major comments identify important gaps in validating the empirical claims. We address each below and commit to revisions that strengthen the manuscript without overstating what was originally done.
read point-by-point responses
-
Referee: [§3 (data generation pipeline) and §4 (experiments)] The central empirical claim (competitive performance on 17 benchmarks) rests on the assumption that the GPT-4V-generated VQA pairs and annotations contain no systematic overlap with the evaluation sets. No decontamination analysis, overlap statistics, or membership inference checks are described for the 1.3M samples relative to the 17 benchmarks; given GPT-4V's training corpus, this leaves open the possibility that reported gains partly reflect implicit test-set supervision rather than improved generalization.
Authors: We agree this is a substantive concern. The synthesis pipeline generates new annotations and VQA pairs via GPT-4V, but the original manuscript contains no decontamination analysis or overlap statistics against the 17 evaluation sets. We will add this analysis in revision, including n-gram overlap counts, semantic similarity filtering, and checks for near-duplicate VQA pairs where computationally feasible. This will be reported in a new subsection under data generation. revision: yes
-
Referee: [Abstract and §4] The abstract and experimental summary state performance outcomes without reporting error bars, number of runs, or statistical significance tests for the benchmark comparisons; this makes it impossible to assess whether the parity with 7B/13B models is robust or within the variance of the evaluation protocol.
Authors: We acknowledge the lack of statistical reporting. All main results were obtained from single training runs owing to the scale of 4B-model training. In the revision we will state the number of runs explicitly, add error bars for the primary benchmarks by re-running a subset of models with different random seeds, and include variance estimates from the existing ablation studies to contextualize the reported parity with larger models. revision: yes
Circularity Check
No circularity in empirical pipeline or benchmark results
full rationale
The paper presents an empirical pipeline: GPT-4V is used to synthesize 1.3M image annotations and VQA pairs, lite VLMs are trained on them, and performance is measured directly on 17 external benchmarks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations are present in the abstract or described claims. Results are falsifiable experimental outcomes rather than reductions to inputs by construction, so the work is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-4V can reliably produce fine-grained image annotations and complex reasoning VQA pairs suitable for VLM training
Forward citations
Cited by 23 Pith papers
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Mask prior drift and positional attention collapse cause failures in LDVLMs for long generations, fixed by training-free Mask Prior Suppression and Monotonic RoPE Scaling.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
-
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Diagnoses mask prior drift and positional attention collapse in LDVLMs and introduces two plug-and-play decoding interventions that raise long-form generation quality without retraining.
-
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
-
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification
A Nash equilibrium framework for training-free multimodal step verification that uses cross-modal agreement and disagreement signals for filtering and ranking reasoning steps.
-
Qwen2.5-VL Technical Report
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
-
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding be...
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
NVILA: Efficient Frontier Visual Language Models
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5....
-
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.
-
LLaVA-OneVision: Easy Visual Task Transfer
LLaVA-OneVision is the first single open LMM to simultaneously achieve strong performance in single-image, multi-image, and video scenarios with cross-scenario transfer capabilities.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=
work page 2023
-
[2]
Improved Baselines with Visual Instruction Tuning , author=. 2023 , eprint=
work page 2023
-
[3]
To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning , author=. 2023 , eprint=
work page 2023
-
[5]
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices , author=. 2023 , eprint=
work page 2023
-
[8]
CogVLM: Visual Expert for Pretrained Language Models , author=. 2023 , eprint=
work page 2023
-
[9]
SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models , author=. 2023 , eprint=
work page 2023
-
[10]
Introducing our Multimodal Models , url =
Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =
- [11]
-
[12]
GPT-4V(ision) System Card , author=
-
[13]
Vision-Flan:Scaling Visual Instruction Tuning , url =
Zhiyang Xu and Trevor Ashby and Chao Feng and Rulin Shao and Ying Shen and Di Jin and Qifan Wang and Lifu Huang , month =. Vision-Flan:Scaling Visual Instruction Tuning , url =
-
[14]
LIMA: Less Is More for Alignment
Lima: Less is more for alignment , author=. arXiv preprint arXiv:2305.11206 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
LAION-5B: An open large-scale dataset for training next generation image-text models , author=. 2022 , eprint=
work page 2022
-
[18]
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , author=. 2022 , eprint=
work page 2022
-
[19]
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , author=. 2023 , eprint=
work page 2023
-
[20]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning , author=. 2023 , eprint=
work page 2023
- [21]
-
[23]
Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , author=. 2016 , eprint=
work page 2016
-
[24]
2019 international conference on document analysis and recognition (ICDAR) , pages=
Ocr-vqa: Visual question answering by reading text in images , author=. 2019 international conference on document analysis and recognition (ICDAR) , pages=. 2019 , organization=
work page 2019
-
[25]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Towards VQA Models That Can Read , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[26]
GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , author=. 2019 , eprint=
work page 2019
-
[27]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts , author=. 2024 , eprint=
work page 2024
-
[28]
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts , author=. 2021 , eprint=
work page 2021
-
[29]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[30]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , author=. 2021 , eprint=
work page 2021
-
[31]
Advances in neural information processing systems , volume=
Im2text: Describing images using 1 million captioned photographs , author=. Advances in neural information processing systems , volume=
- [32]
-
[33]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=
work page 2024
-
[34]
Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=
work page 2023
-
[35]
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=
work page 2024
- [36]
-
[37]
Stable LM 2 1.6B , author=
-
[38]
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs , author=. 2023 , eprint=
work page 2023
- [39]
-
[40]
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones , author=. 2023 , eprint=
work page 2023
-
[41]
Phoenix: Democratizing ChatGPT across Languages , author=. 2023 , eprint=
work page 2023
-
[42]
An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning , author=. 2023 , eprint=
work page 2023
-
[43]
Investigating the Catastrophic Forgetting in Multimodal Large Language Models , author=. 2023 , eprint=
work page 2023
-
[44]
Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=
work page 2021
-
[45]
Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=
work page 2021
-
[46]
Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=
work page 2021
-
[47]
and Stoica, Ion and Xing, Eric P
Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
-
[48]
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , author=. 2022 , eprint=
work page 2022
-
[49]
MMBench: Is Your Multi-modal Model an All-around Player? , author=. 2023 , eprint=
work page 2023
-
[50]
MLLM-Bench, Evaluating Multi-modal LLMs using GPT-4V , author=. 2023 , eprint=
work page 2023
-
[51]
TouchStone: Evaluating Vision-Language Models by Language Models , author=. 2023 , eprint=
work page 2023
-
[52]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. 2023 , eprint=
work page 2023
-
[53]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. 2023 , eprint=
work page 2023
- [54]
-
[55]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. 2023 , eprint=
work page 2023
-
[57]
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , eprint=
work page 2023
-
[58]
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs , author=. 2024 , eprint=
work page 2024
-
[59]
Learning multiple layers of features from tiny images , author=. 2009 , publisher=
work page 2009
-
[60]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Nocaps: Novel object captioning at scale , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[61]
Transactions of the Association for Computational Linguistics , volume=
From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=
work page 2014
-
[63]
Aligning Large Multimodal Models with Factually Augmented RLHF , author=. 2023 , eprint=
work page 2023
-
[64]
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders , author=. 2023 , eprint=
work page 2023
-
[65]
DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection , author=. 2022 , eprint=
work page 2022
-
[66]
DINOv2: Learning Robust Visual Features without Supervision , author=. 2023 , eprint=
work page 2023
-
[67]
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale , author=. 2022 , eprint=
work page 2022
-
[68]
Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint=
work page 2021
-
[69]
Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint=
work page 2023
- [70]
-
[71]
Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =
-
[72]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks , author=. arXiv preprint arXiv:2312.14238 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[73]
Journal of machine Learning research , volume=
Latent dirichlet allocation , author=. Journal of machine Learning research , volume=
-
[74]
MALLET: A Machine Learning for Language Toolkit
Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit
-
[75]
On the Difference of BERT-style and CLIP-style Text Encoders , author=. 2023 , eprint=
work page 2023
- [76]
-
[77]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
-
[78]
Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=
work page 2007
-
[79]
Dan Gusfield , title =. 1997
work page 1997
-
[80]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[81]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =. 2005 , url=
work page 2005
-
[82]
Cooley, James W. and Tukey, John W. , journal=. An algorithm for the machine calculation of complex. 1965 , url=
work page 1965
- [83]
-
[84]
Publications Manual , year = "1983", publisher =
work page 1983
-
[85]
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen...
work page 2024
-
[86]
Nocaps: Novel object captioning at scale
Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 8948--8957, 2019
work page 2019
-
[87]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023 a
work page 2023
-
[88]
Touchstone: Evaluating vision-language models by language models, 2023 b
Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models, 2023 b
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.