Recognition: 3 theorem links
· Lean TheoremDeepSeek-VL: Towards Real-World Vision-Language Understanding
Pith reviewed 2026-05-11 17:51 UTC · model grok-4.3
The pith
DeepSeek-VL models achieve competitive or state-of-the-art results on vision-language benchmarks while delivering strong practical user experiences as chatbots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.
What carries the argument
A hybrid vision encoder that handles 1024x1024 images at low compute cost, paired with a VL pretraining strategy that integrates LLM training from the start to balance vision and language modalities.
If this is right
- Smaller open-source models become practical choices for real-world visual tasks such as document and chart understanding.
- Instruction tuning drawn from actual use-case taxonomies measurably improves everyday user experience.
- Early integration of language-model training during pretraining preserves performance on text-only benchmarks.
- High-resolution image processing becomes feasible in VL models without large increases in compute overhead.
Where Pith is reading between the lines
- The same data-and-training balance could be tested on other multimodal tasks where one modality tends to dominate training.
- Public release of these base models may speed up development of specialized tools for web and document analysis.
- Emphasis on use-case-derived data might reduce the gap between lab benchmark scores and real deployment performance.
Load-bearing premise
The combination of diverse real-world data, the hybrid encoder, and the balanced pretraining strategy will produce clear gains in user experience and benchmark scores without hidden losses in capability or efficiency.
What would settle it
Direct side-by-side testing on the same visual-language benchmarks or user studies that shows the models falling behind other models of equal size in either accuracy or perceived chatbot quality.
read the original abstract
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeepSeek-VL, an open-source family of vision-language models (1.3B and 7B) for real-world applications. It details a data pipeline emphasizing diversity and coverage of practical scenarios (web screenshots, PDFs, OCR, charts, knowledge content), construction of an instruction-tuning dataset via a use-case taxonomy derived from real user scenarios, a hybrid vision encoder supporting 1024x1024 images at modest compute cost, and an early-integration VL pretraining strategy that interleaves LLM training to preserve language capabilities while addressing modality competition. The central claims are that the resulting models provide superior chatbot user experience in practical settings, achieve SOTA or competitive scores on a range of VL benchmarks at comparable sizes, and retain robust performance on language-centric benchmarks, with both model sizes released publicly.
Significance. If the empirical results hold, the work offers a publicly available VL model explicitly tuned for real-world utility and efficiency, with a design that prioritizes retention of base LLM strengths. The hybrid encoder and taxonomy-driven instruction data represent concrete engineering choices that could inform subsequent multimodal systems. Public model release enables direct verification and extension.
major comments (2)
- [§3] Abstract and §3 (VL pretraining strategy): The claim that the early-integration pretraining 'ensures the preservation of LLM capabilities' and yields 'robust performance on language-centric benchmarks' is load-bearing for the no-trade-off assertion. The manuscript reports only final VL-model scores; without side-by-side tables comparing the 1.3B/7B DeepSeek-VL variants to the unmodified DeepSeek-LLM baselines on identical language tasks (e.g., MMLU, GSM8K), the effectiveness of the strategy in managing vision-language competition cannot be verified.
- [§4] §4 (Experiments and benchmarks): The abstract asserts 'state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size.' The reported numbers must be accompanied by explicit model-size-matched baselines (e.g., LLaVA-1.5-7B, Qwen-VL-7B) and ablation results isolating the contribution of the hybrid encoder and the use-case instruction dataset; otherwise the superiority claim rests on incomplete controls.
minor comments (2)
- [§2.2] The hybrid vision encoder architecture (described in §2.2) would benefit from a diagram showing the integration points with the LLM and the exact tokenization of high-resolution patches.
- [§2.1] Dataset statistics (total tokens, image-text pair counts, taxonomy coverage percentages) are referenced but not tabulated; adding a summary table would strengthen the 'diverse, scalable' claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: [§3] Abstract and §3 (VL pretraining strategy): The claim that the early-integration pretraining 'ensures the preservation of LLM capabilities' and yields 'robust performance on language-centric benchmarks' is load-bearing for the no-trade-off assertion. The manuscript reports only final VL-model scores; without side-by-side tables comparing the 1.3B/7B DeepSeek-VL variants to the unmodified DeepSeek-LLM baselines on identical language tasks (e.g., MMLU, GSM8K), the effectiveness of the strategy in managing vision-language competition cannot be verified.
Authors: We agree that direct side-by-side comparisons to the base DeepSeek-LLM models on language-only benchmarks would provide clearer evidence for the effectiveness of our early-integration pretraining strategy. We will add a dedicated table in the revised manuscript reporting results on MMLU, GSM8K, and similar tasks for both the 1.3B and 7B DeepSeek-VL models alongside the unmodified DeepSeek-LLM baselines. revision: yes
-
Referee: [§4] §4 (Experiments and benchmarks): The abstract asserts 'state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size.' The reported numbers must be accompanied by explicit model-size-matched baselines (e.g., LLaVA-1.5-7B, Qwen-VL-7B) and ablation results isolating the contribution of the hybrid encoder and the use-case instruction dataset; otherwise the superiority claim rests on incomplete controls.
Authors: The experiments section already includes comparisons against size-matched models such as LLaVA-1.5-7B and Qwen-VL-7B on the reported VL benchmarks. To address the request for stronger controls, we will explicitly annotate all tables with model sizes and add ablation studies that isolate the impact of the hybrid vision encoder and the use-case taxonomy instruction dataset. revision: yes
Circularity Check
No significant circularity: empirical model development with external benchmark validation
full rationale
The paper presents an empirical vision-language model (DeepSeek-VL) whose central claims rest on reported benchmark scores and user-experience improvements from instruction tuning on a constructed dataset. No mathematical derivations, equations, or predictions are present that could reduce to fitted parameters or self-definitions by construction. The described pretraining strategy, hybrid encoder, and data curation are design choices justified by practical considerations and external evaluations rather than internal circular logic. Self-citations, if any, are not load-bearing for the performance claims, which are falsifiable against public benchmarks. This is a standard model-release paper whose validity hinges on reproducibility of the reported numbers, not on any derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcingalexander_duality_circle_linking unclearDeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclearThe DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks.
Forward citations
Cited by 33 Pith papers
-
SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models
SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
-
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
Recorruption arises from visual attention suppression and positional bias in multimodal RAG; BAIR mitigates it via bottleneck attention intervention at inference time.
-
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregr...
-
Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding
PND reduces object hallucination in VLMs via a dual-path contrast during decoding that amplifies visual features and penalizes linguistic priors, achieving reported SOTA results on POPE, MME, and CHAIR without retraining.
-
PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation
PBS-VL trained on the new PBSInstr dataset outperforms general and pathology MLLMs on the PBSBench VQA tasks for hematopathology.
-
ROSE: Retrieval-Oriented Segmentation Enhancement
ROSE is a retrieval-augmented plug-in that improves MLLM segmentation on novel and emerging entities by fetching web text and images and deciding when to use them.
-
DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized metho...
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
-
AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.
-
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
Online Self-Calibration Against Hallucination in Vision-Language Models
OSCAR exploits the generative-discriminative gap in LVLMs to build online preference data with MCTS and dual-granularity rewards for DPO-based calibration, claiming SOTA hallucination reduction and improved multimodal...
-
SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct activates spatial awareness in VLMs using action retrospection, future frame prediction, and progressive curriculum learning to reach SOTA on VLN-CE benchmarks.
-
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
-
R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs
R-CoV is a six-step region-aware chain-of-verification technique that elicits coordinate and description outputs from LVLMs themselves to detect and reduce object hallucinations without external models or retraining.
-
If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems
LVLM-based agents exhibit trust boundary confusion with visual injections and a multi-agent defense separating perception from decision-making reduces misleading responses while preserving correct ones.
-
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
-
Boosting Visual Instruction Tuning with Self-Supervised Guidance
Mixing 3-10% of visually grounded self-supervised instructions into visual instruction tuning consistently boosts MLLM performance on vision-centric benchmarks.
-
Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Symbiotic-MoE introduces modality-aware expert disentanglement and progressive training in a multimodal MoE to achieve synergistic generation and understanding without task interference or extra parameters.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
-
DBLP: Phase-Aware Bounded-Loss Transport for Burst-Resilient Distributed ML Training
DBLP is a training-phase-aware bounded-loss transport protocol that reduces end-to-end distributed ML training time by 24.4% on average (up to 33.9%) and achieves up to 5.88x communication speedup during microbursts w...
-
Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models
A self-captioning method using a Multimodal Interaction Gate amplifies redundant interactions to reduce visual-induced errors by 38.3% and improve consistency by 16.8% in vision-language models.
-
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
-
Make Your LVLM KV Cache More Lightweight
LightKV compresses vision-token KV cache in LVLMs to 55% size via prompt-guided cross-modality aggregation, halving cache memory, cutting compute 40%, and maintaining performance on benchmarks.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training
UniEP fuses MoE communication and computation into unified MegaKernels with deterministic token ordering, delivering 1.03x-1.38x speedups over prior work while preserving training accuracy.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
URL https://www.anthropic.com/index/introd ucing-claude. J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
- [4]
-
[5]
23 L. Chen, J. Li, X. Dong, P . Zhang, C. He, J. Wang, F. Zhao, and D. Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793,
work page internal anchor Pith review arXiv
-
[6]
C. K. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al. Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1571–1576. IEEE,
work page 2019
-
[7]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL https://github.com/deepseek-ai/DeepSeek-L LM. X. Dong, P . Zhang, Y. Zang, Y. Cao, B. Wang, L. Ouyang, X. Wei, S. Zhang, H. Duan, M. Cao, et al. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420,
-
[10]
G-llava: Solving geometric problem with multi-modal large language model
W. Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org. J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, et al. G- llava: Solving geometric problem with multi-modal large language model. arXiv preprint arXiv:2312.11370,
-
[11]
L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
URL https://blog.google/tech nology/ai/bard-google-ai-search-updates/ . D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
- [13]
- [14]
-
[15]
Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.277. URL https://aclanthology.org/2022.acl-long.277. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2022.acl-long.277 2022
-
[16]
A. Kulkarni and J. Truelsen. wkhtmltopdf. https://wkhtmltopdf.org/. Project maintained by Ashish Kulkarni, originally created by Jakob Truelsen. Accessed: 2024-02-22. LAION. Gpt-4v dataset. https://huggingface.co/datasets/laion/gpt4v-dataset ,
work page 2024
-
[17]
B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and Y. Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a. S. Li and N. Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs,
work page internal anchor Pith review arXiv
- [18]
-
[19]
Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b. J. Lin, H. Yin, W. Ping, Y. Lu, P . Molchanov, A. Tao, H. Mao, J. Kautz, M. Shoeybi, and S. Han. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023a. Z. Lin...
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [20]
-
[21]
P . Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P . Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022a. P . Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P . Clark, and A....
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
-
[23]
N. Nayef, F. Yin, I. Bizid, H. Choi, Y. Feng, D. Karatzas, Z. Luo, U. Pal, C. Rigaud, J. Chazalon, et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt. In 2017 14th IAPR international conference on document analysis and recognition (ICDAR), volume 1, pages 1454–1459. IEEE,
work page 2017
-
[24]
N. Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[25]
B. Shi, C. Yao, M. Liao, M. Yang, P . Xu, L. Cui, S. Belongie, S. Lu, and X. Bai. Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th iapr international conference on document analysis and recognition (ICDAR), volume 1, pages 1429–1434. IEEE,
work page 2017
-
[26]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
- [27]
-
[28]
Y. Sun, Z. Ni, C.-K. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. Icdar 2019 competition on large-scale street view text with partial labeling-rrc-lsvt. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1557–1562. IEEE,
work page 2019
-
[29]
G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
- [30]
-
[31]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, D. Bikel, L. Blech...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2307.09288
-
[32]
J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and Y.-G. Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023a. W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b....
- [33]
- [34]
-
[35]
Q. Yu, Q. Sun, X. Zhang, Y. Cui, F. Zhang, Y. Cao, X. Wang, and J. Liu. Capsfusion: Rethinking image-text data at scale. arXiv preprint arXiv:2310.20550, 2023a. W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang, and L. Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b. X. Yue, Y. Ni,...
-
[36]
R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi. HellaSwag: Can a machine really finish your sentence? In A. Korhonen, D. R. Traum, and L. Màrquez, editors,Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association fo...
work page 2019
-
[37]
H ella S wag: Can a Machine Really Finish Your Sentence?
doi: 10.18653/v1/p19-1472. URL https://doi.org/10.18653/v1/p1 9-1472. B. Zhang and R. Sennrich. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32,
- [38]
- [39]
- [40]
-
[41]
Agieval: A human-centric benchmark for evaluating foundation models
W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan. AGIEval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364,
-
[42]
Agieval: A human-centric benchmark for evaluating foundation models
doi: 10.48550/arXiv.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364. W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi. Multimodal c4: An open, billion-scale corpus of images interleaved with text. Advances in Neural Information Processing Systems, 36,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.