Recognition: 2 theorem links
· Lean TheoremMini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Pith reviewed 2026-05-17 07:39 UTC · model grok-4.3
The pith
Mini-Gemini narrows the gap with top vision-language models by adding high-resolution image handling, better data, and self-guided generation without raising token counts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mini-Gemini mines the potential of multi-modality vision-language models by introducing high-resolution visual tokens through an additional encoder that does not increase token count, a high-quality dataset that supports precise comprehension and reasoning-based generation, and VLM-guided generation, thereby enabling simultaneous image understanding, reasoning, and generation while supporting dense and MoE large language models from 2B to 34B and reaching leading results on multiple zero-shot benchmarks.
What carries the argument
The three-aspect enhancement framework that pairs an extra high-resolution visual encoder with a curated dataset and VLM-guided generation to expand capabilities while holding visual token count steady.
If this is right
- VLMs gain the ability to perform any-to-any workflows that include both understanding and generation in one session.
- The same base models now reach or exceed some private models on several zero-shot benchmarks.
- The method applies uniformly to both dense and mixture-of-experts language models across the 2B-to-34B size range.
- Image reasoning and generation become native operations inside the same forward pass rather than separate stages.
Where Pith is reading between the lines
- Resolution limits in current visual tokenizers may be a larger bottleneck than previously assumed, since extra detail can be added without token inflation.
- High-quality paired data focused on reasoning chains could transfer to other multimodal tasks beyond the ones tested here.
- Self-guided generation opens a route for iterative refinement loops that stay inside the model rather than relying on external tools.
- The approach may generalize to video or other sequential visual inputs by extending the same high-resolution refinement idea.
Load-bearing premise
The reported gains come mainly from the added high-resolution encoder, the new dataset, and the guided generation step rather than from undisclosed training choices or benchmark selection.
What would settle it
A controlled experiment that removes the high-resolution encoder or the new dataset and still matches the original benchmark scores on the same zero-shot tasks would falsify the claim that these three components are the primary drivers.
read the original abstract
In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current frameworks with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpasses the developed private models. Code and models are available at https://github.com/dvlab-research/MiniGemini.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mini-Gemini, a framework for enhancing multi-modal vision-language models via three components: an additional high-resolution visual encoder that refines tokens without increasing their count, a constructed high-quality dataset promoting precise comprehension and reasoning-based generation, and VLM-guided generation. The method supports dense and MoE LLMs ranging from 2B to 34B parameters and reports leading results on multiple zero-shot benchmarks, sometimes surpassing private models.
Significance. If the reported gains can be shown to arise primarily from the three proposed components rather than from uncontrolled differences in training regime or evaluation protocol, the work would offer a practical route to improve VLM efficiency and capability in understanding, reasoning, and generation tasks while keeping visual token budgets fixed. Open-sourcing of code and models supports reproducibility.
major comments (2)
- [§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.
- [Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.
minor comments (2)
- [§3] The integration of the high-resolution encoder with the base visual encoder is described at a high level; a precise statement of how token count is preserved (e.g., via pooling or projection) would aid clarity.
- [Abstract] The abstract states that Mini-Gemini 'surpasses the developed private models' without naming those models or providing the corresponding scores in the main text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications on our experimental design and indicate where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): the central claim that performance improvements stem from the high-resolution encoder, high-quality dataset, and VLM-guided generation is not supported by controlled ablations. No factorial experiments hold training details, optimizer settings, data mixture ratios, and epoch counts fixed while toggling each component independently; comparisons to baselines therefore leave open the possibility that gains arise from undisclosed hyperparameter tuning or benchmark selection.
Authors: We appreciate the referee's point regarding the need for more rigorously controlled ablations. The manuscript reports incremental results when adding each component (high-resolution refinement, curated data, and guided generation) while keeping the underlying LLM and training framework consistent across scales from 2B to 34B. However, we acknowledge that a complete factorial design holding every hyperparameter, data ratio, and epoch count fixed would provide stronger isolation of effects. Full factorial experiments at the 34B scale are computationally prohibitive; therefore, in the revision we will add targeted controlled ablations on the 2B and 7B models, fixing all other variables and toggling one component at a time, to better substantiate the contribution of each element. revision: yes
-
Referee: [Tables 1–3] Main results tables (e.g., Tables 1–3): reported scores lack error bars, standard deviations, or the number of independent runs, so it is impossible to assess whether the claimed leadership on zero-shot benchmarks is statistically robust.
Authors: We agree that reporting variability would allow better evaluation of statistical robustness. Given the substantial compute required to train and evaluate models up to 34B parameters, multiple independent runs per configuration were not feasible, which is a common practical constraint in large-scale VLM literature. In the revised manuscript we will add an explicit discussion of this limitation in Section 4, note that all compared models followed identical training protocols, and emphasize that performance gains remain consistent across diverse benchmarks and model sizes as supporting evidence of reliability. revision: partial
Circularity Check
Empirical framework evaluated on external benchmarks exhibits no circularity
full rationale
The manuscript introduces Mini-Gemini as an engineering framework that augments existing VLMs via a high-resolution encoder, a constructed high-quality dataset, and VLM-guided generation. Performance is reported on independent zero-shot benchmarks external to the paper. No equations, fitted parameters, or predictions are defined in terms of themselves; no self-citation chain is invoked to justify uniqueness or load-bearing premises; and no renaming of known results occurs. The derivation chain consists of component proposals followed by empirical measurement against outside references, satisfying the criteria for a self-contained, non-circular contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- high-resolution encoder hyperparameters
axioms (1)
- domain assumption An auxiliary high-resolution visual encoder can improve feature quality while keeping visual token count unchanged.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count... patch info mining... TV = MLP(Q + Softmax(ϕ(Q) × ϕ(K)T ) × ϕ(V ))
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B... leading performance in several zero-shot benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 19 Pith papers
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
-
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance...
-
GRIP-VLM: Group-Relative Importance Pruning for Efficient Vision-Language Models
GRIP-VLM applies group-relative policy optimization via reinforcement learning to prune visual tokens in VLMs, yielding up to 15% inference speedup at matched accuracy over prior methods.
-
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
-
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
SpatialStack: Layered Geometry-Language Fusion for 3D VLM Spatial Reasoning
SpatialStack improves 3D spatial reasoning in vision-language models by stacking and synchronizing multi-level geometric features with the language backbone.
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
Analogical Reasoning as a Doctor: A Foundation Model for Gastrointestinal Endoscopy Diagnosis
RATNet applies analogical reasoning via a cyclic pre-training strategy to outperform prior foundation models in GI endoscopy diagnosis across diagnosis, few-shot, zero-shot, robustness, adaptation, and federated scenarios.
-
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
PaliGemma: A versatile 3B VLM for transfer
PaliGemma is an open 3B VLM based on SigLIP and Gemma that achieves strong performance on nearly 40 diverse open-world tasks including benchmarks, remote-sensing, and segmentation.
-
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Reference graph
Works this paper leans on
-
[1]
OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023. 2
work page 2023
-
[2]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv:2312.11805, 2023. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeruIPS, 2023. 2, 3, 4
work page 2023
-
[8]
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Llama-vid: An image is worth 2 tokens in large language models
Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023. 2, 6
-
[11]
Llava-next: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 2, 5, 6, 8
work page 2024
-
[12]
Otterhd: A high- resolution multi-modality model
Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, and Ziwei Liu. Otterhd: A high- resolution multi-modality model. arXiv:2311.04219, 2023. 2, 6
-
[13]
Introducing our multimodal models, 2023
Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sa˘gnak Ta¸ sırlar. Introducing our multimodal models, 2023. URLhttps://www.adept.ai/blog/fuyu-8b. 2
work page 2023
-
[14]
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv:2311.12793, 2023. 2, 5, 8
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
ALLaV A harness- ing gpt4v-synthesized data for a lite vision-language model
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhihong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv:2402.11684, 2024. 2, 5, 8
-
[16]
Making the v in vqa matter: Elevating the role of image understanding in visual question answering
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017. 2, 3
work page 2017
-
[17]
Document collection visual question answering
Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Document collection visual question answering. In ICDAR 2021, 2021. 5, 11
work page 2021
-
[18]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv:2203.10244, 2022. 5, 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. In ECCV, 2016. 2, 5, 11
work page 2016
-
[20]
Lima: Less is more for alignment
Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024. 2, 5
work page 2024
-
[21]
Openassistant conversations- democratizing large language model alignment
Andreas Köpf, Yannic Kilcher, Dimitri von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al. Openassistant conversations- democratizing large language model alignment. Advances in Neural Information Processing Systems , 36,
-
[22]
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023. 2, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...
work page 2024
-
[26]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017. 2
work page 2017
-
[27]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv:1810.04805, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[28]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020. 2
work page 2020
-
[29]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv:2401.04088, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[31]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022. 3
work page 2022
- [32]
-
[33]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/ ,
work page 2023
-
[34]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Gpt4tools: Teaching large language model to use tools via self-instruction
Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023. 3
-
[36]
Gemma: Introducing new state-of-the-art open models
Google. Gemma: Introducing new state-of-the-art open models. hhttps://blog.google/technology/ developers/gemma-open-models/, 2024. 3, 6
work page 2024
-
[37]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022. 3
work page 2022
-
[39]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023. 3
-
[40]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 3, 4, 6 17
work page 2021
-
[41]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022. 3
work page 2022
-
[42]
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023. 3, 4, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023. 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and com...
work page internal anchor Pith review arXiv 2023
-
[45]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2: Mastering free-form text-image composition and comprehe...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Emu: Generative Pretraining in Multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Generative multimodal models are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, et al. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023. 3, 5
-
[48]
arXiv preprint arXiv:2307.08041 , year=
Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying Shan. Planting a seed of vision in large language model. arXiv preprint arXiv:2307.08041, 2023. 3
-
[49]
Making llama see and draw with seed tokenizer
Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making llama see and draw with seed tokenizer. arXiv preprint arXiv:2310.01218, 2023. 3
-
[50]
Llmga: Multimodal large language model based generation assistant
Bin Xia, Shiyin Wang, Yingfan Tao, Yitong Wang, and Jiaya Jia. Llmga: Multimodal large language model based generation assistant. arXiv preprint arXiv:2311.16500, 2023. 3, 5, 15
-
[51]
Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model
Xiaowei Chi, Yijiang Liu, Zhengkai Jiang, Rongyu Zhang, Ziyi Lin, Renrui Zhang, Peng Gao, Chaoyou Fu, Shanghang Zhang, Qifeng Liu, et al. Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model. arXiv preprint arXiv:2311.17963, 2023. 8, 10, 15
-
[52]
Anygpt: Unified multimodal llm with discrete sequence modeling
Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024. 3, 5, 8, 10
-
[53]
Improving image generation with better captions
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf , 2(3):8, 2023. 3, 5
work page 2023
-
[54]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade W Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa R Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5b: An open large-scale dataset for training next generation image-text mod...
work page 2022
-
[55]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In CVPR, 2022. 4
work page 2022
-
[56]
Pablo Pernias, Dominic Rampas, Mats L. Richter, Christopher J. Pal, and Marc Aubreville. Wuerstchen: An efficient architecture for large-scale text-to-image diffusion models, 2023. 5
work page 2023
-
[57]
Video generation models as world simulators
OpenAI. Video generation models as world simulators. URL https://openai.com/research/ video-generation-models-as-world-simulators . 5
-
[58]
Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018. 5 18
work page 2018
-
[59]
Textcaps: a dataset for image captioning with reading comprehension
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and Amanpreet Singh. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020. 5, 8
work page 2020
-
[60]
Laion/gpt4v-dataset· datasets at hugging face
LAION eV . Laion/gpt4v-dataset· datasets at hugging face. URL https://huggingface.co/datasets/ laion/gpt4v-dataset. 5, 8
-
[61]
Dvqa: Understanding data visualizations via question answering
Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan. Dvqa: Understanding data visualizations via question answering. In CVPR, 2018. 5
work page 2018
-
[62]
URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts
stable-diffusion-prompts. URL https://www.gigasheet.com/sample-data/ stable-diffusion-prompts. 5
-
[63]
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv:2312.16886, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Introducing idefics: An open reproduction of state-of-the-art visual language model
IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https: //huggingface.co/blog/idefics, 2023. 6
work page 2023
-
[66]
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv:2311.03079, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[67]
Towards vqa models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In CVPR, 2019. 6, 7, 8, 11
work page 2019
-
[68]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, Xiawu Zheng, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
-
[70]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024. 6, 7
work page 2024
-
[71]
Awesome multilingual ocr toolkits based on paddlepaddle
PaddleOCR. Awesome multilingual ocr toolkits based on paddlepaddle. URL https://github.com/ PaddlePaddle/PaddleOCR. 11 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.