Recognition: 3 theorem links
· Lean TheoremVisual Instruction Tuning
Pith reviewed 2026-05-11 08:16 UTC · model grok-4.3
The pith
Using GPT-4 to generate visual instruction data trains an end-to-end model that connects vision and language for general-purpose multimodal understanding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using language-only GPT-4 to generate multimodal language-image instruction-following data, the authors instruction-tune LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM. This produces strong zero-shot multimodal chat abilities that sometimes match multimodal GPT-4 behaviors on unseen images and instructions, an 85.1 percent relative score against GPT-4 on a synthetic instruction-following dataset, and a new state-of-the-art 92.53 percent accuracy on Science QA when the model is fine-tuned in synergy with GPT-4.
What carries the argument
LLaVA, the end-to-end trained large multimodal model that connects a vision encoder and LLM via instruction tuning on GPT-4 generated visual instruction data.
If this is right
- LLaVA exhibits impressive multimodal chat abilities on unseen images and instructions, sometimes matching multimodal GPT-4 behaviors.
- The model reaches an 85.1 percent relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset.
- Fine-tuning LLaVA in synergy with GPT-4 produces a new state-of-the-art accuracy of 92.53 percent on Science QA.
- The approach extends the benefits of instruction tuning from text-only LLMs into the multimodal setting using only language-based data generation.
Where Pith is reading between the lines
- The data-generation technique could be iterated to create larger or more specialized training sets for other visual reasoning tasks without additional human labeling.
- Public release of the generated data, model weights, and code base allows direct measurement of how well the same pipeline transfers to new vision encoders or language models.
- The observed synergy on Science QA suggests that hybrid systems pairing a tuned multimodal model with a stronger language model may outperform either component alone on grounded reasoning benchmarks.
Load-bearing premise
GPT-4 can generate sufficiently diverse, accurate, and representative multimodal instruction-following data from language-only inputs to support effective training and generalization to real images and instructions.
What would settle it
Testing LLaVA on a broad collection of real-world photographs paired with instructions that were never described in the GPT-4 data generation process and measuring whether accuracy falls substantially below the reported levels on the synthetic benchmark or Science QA.
read the original abstract
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce LLaVA, the first large multimodal model trained by instruction tuning on data generated using language-only GPT-4. It reports that LLaVA exhibits impressive multimodal chat abilities, achieving an 85.1% relative score to GPT-4 on a synthetic multimodal instruction-following dataset, and reaches a new SOTA accuracy of 92.53% on Science QA through synergy with GPT-4. The GPT-4 generated data, model, and codebase are released publicly.
Significance. This is an early and influential attempt at visual instruction tuning, potentially opening a new direction for training general-purpose vision-language models. The public availability of the generated dataset and model is a notable strength that facilitates reproducibility and further research. If the generalization claims hold beyond synthetic data, it could significantly impact the development of multimodal assistants.
major comments (3)
- Abstract: The reported 85.1% relative score compared with GPT-4 is evaluated on a synthetic multimodal instruction-following dataset generated using GPT-4. This setup introduces circularity, as the model is trained and tested on data from the same source, which does not provide strong evidence for generalization to real-world unseen images and instructions as claimed.
- Abstract: The new state-of-the-art accuracy of 92.53% on Science QA is achieved via 'synergy of LLaVA and GPT-4', but the manuscript does not specify the exact integration method (e.g., whether GPT-4 is used for answer generation or verification), nor does it report LLaVA's standalone performance or comparisons to other fine-tuned models without GPT-4 assistance.
- Data generation section (likely §3): The process of using language-only GPT-4 to generate multimodal instruction data is not described in sufficient detail, including the specific prompts, how image content is conveyed (e.g., via captions or other textual proxies), and measures to ensure diversity and accuracy. This is critical because any limitations in the generated data would directly affect the trained model's capabilities on real images.
minor comments (2)
- Abstract: Typo: 'multimodel chat abilities' should be 'multimodal chat abilities'.
- Abstract: The phrase 'early experiments' indicates preliminary results; consider adding a limitations section discussing potential issues with synthetic data and generalization.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The reported 85.1% relative score compared with GPT-4 is evaluated on a synthetic multimodal instruction-following dataset generated using GPT-4. This setup introduces circularity, as the model is trained and tested on data from the same source, which does not provide strong evidence for generalization to real-world unseen images and instructions as claimed.
Authors: We agree that evaluating on GPT-4-generated data introduces a degree of circularity. The 85.1% figure measures performance on a held-out test set of synthetic instructions, distinct from the training data. This serves as a proxy to assess how well LLaVA can follow complex multimodal instructions in a manner similar to GPT-4. To support claims of generalization, the manuscript includes qualitative demonstrations on real, unseen images and instructions, as well as results on the ScienceQA benchmark featuring authentic images and questions. We will revise the abstract to more precisely describe the evaluation setup and avoid overstating generalization based solely on this metric. revision: partial
-
Referee: Abstract: The new state-of-the-art accuracy of 92.53% on Science QA is achieved via 'synergy of LLaVA and GPT-4', but the manuscript does not specify the exact integration method (e.g., whether GPT-4 is used for answer generation or verification), nor does it report LLaVA's standalone performance or comparisons to other fine-tuned models without GPT-4 assistance.
Authors: Thank you for this observation. The synergy involves LLaVA generating candidate answers which are then refined or verified using GPT-4, but we recognize that the description lacks specificity. In the revision, we will provide a detailed explanation of the integration method in the relevant section, report the standalone accuracy of LLaVA on Science QA, and add comparisons to other fine-tuned vision-language models to better contextualize the results. revision: yes
-
Referee: Data generation section (likely §3): The process of using language-only GPT-4 to generate multimodal instruction data is not described in sufficient detail, including the specific prompts, how image content is conveyed (e.g., via captions or other textual proxies), and measures to ensure diversity and accuracy. This is critical because any limitations in the generated data would directly affect the trained model's capabilities on real images.
Authors: We concur that additional details on data generation are important for reproducibility and understanding potential limitations. The current manuscript outlines the high-level approach, but we will expand this section to include example prompts provided to GPT-4, clarify that image content is represented through rich textual captions and detected objects, and describe our efforts to ensure diversity (such as sampling varied image sources and instruction types) and accuracy (including manual inspection and filtering of generated data). These additions will help readers assess the quality of the training data. revision: yes
Circularity Check
No significant circularity in claimed results or method
full rationale
The paper presents an empirical construction of LLaVA via GPT-4-generated instruction data followed by end-to-end training and reports results on both a held-out synthetic split and the external ScienceQA benchmark. No mathematical derivation, equations, or uniqueness theorems are invoked; the 85.1% relative score is an explicit comparison against the data generator on data from the same distribution, which is a standard (if limited) evaluation choice rather than a reduction by construction. No self-citations appear as load-bearing premises, and the central claim—that the resulting model exhibits GPT-4-like chat behavior on unseen images—rests on the training procedure and qualitative examples rather than any self-referential loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained vision encoder can be effectively integrated with an LLM through instruction tuning for multimodal understanding.
Forward citations
Cited by 60 Pith papers
-
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
-
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
-
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
-
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
-
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
-
CATS: Curvature Aware Temporal Selection for efficient long video understanding
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
-
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
-
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.
-
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
3D-VLA: A 3D Vision-Language-Action Generative World Model
3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
-
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
-
Evaluating Object Hallucination in Large Vision-Language Models
Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
-
WizardLM: Empowering large pre-trained language models to follow complex instructions
WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
-
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Personal Visual Context Learning in Large Multimodal Models
Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
-
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
-
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
-
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
-
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
-
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
-
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
-
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
-
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...
-
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
-
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
Are We on the Right Way for Evaluating Large Vision-Language Models?
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
-
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
-
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
-
MMBench: Is Your Multi-modal Model an All-around Player?
MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
-
Kosmos-2: Grounding Multimodal Large Language Models to the World
Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
-
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
-
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition
A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.
-
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
-
Bolek: A Multimodal Language Model for Molecular Reasoning
Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary class...
-
Qwen3.5-Omni Technical Report
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
-
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
-
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
-
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
-
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
Reference graph
Works this paper leans on
-
[1]
https://github.com/hwchase17/langchain, 2022
Langchain. https://github.com/hwchase17/langchain, 2022. 2
work page 2022
-
[2]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 2, 4
work page internal anchor Pith review arXiv 2022
-
[3]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2018. 2
work page 2018
-
[4]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. 1
work page internal anchor Pith review arXiv 2021
-
[5]
Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023. 2, 6, 7
work page 2023
-
[6]
In- structpix2pix: Learning to follow image editing instructions
Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instruct pix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 2
work page 1901
-
[8]
Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 2
work page 2021
-
[9]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 1, 2, 4, 5, 6
work page 2023
-
[10]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[12]
CVinW. Computer vision in the wild. https://github.com/ Computer-Vision-in-the-Wild/CVinW_Readings , 2022. 1
work page 2022
-
[13]
PaLM-E: An Embodied Multimodal Language Model
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement
Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, and Oncel Tuzel. Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement. arXiv preprint arXiv:2303.08983, 2023. 2
-
[15]
Make-a-scene: Scene-based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.ArXiv, abs/2203.13131,
-
[16]
Vision- language pre-training: Basics, recent advances, and future trends
Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. F oundations and Trends® in Computer Graphics and Vision, 2022. 1
work page 2022
-
[17]
Chatgpt outperforms crowd-workers for text-annotation tasks
Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023. 3
-
[18]
Visual Programming: Compositional visual reasoning without training
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 2
-
[19]
Towards learning a generic agent for vision-and-language navigation via pre-training
Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, 2020. 2
work page 2020
-
[20]
Language is not all you need: Aligning perception with language models
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023. 2
-
[21]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. July 2021. If you use this software, please cite it as below. 1
work page 2021
-
[22]
Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Dániel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017,
-
[23]
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, 2022. 2
work page 2022
-
[24]
Grounding language models to images for multimodal generation
Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023. 2
-
[25]
Language- driven semantic segmentation
Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language- driven semantic segmentation. ICLR, 2022. 1
work page 2022
-
[26]
Multimodal founda- tion models: From specialists to general-purpose assistants
Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023. 1
-
[27]
ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models
Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022. 1
work page 2022
-
[28]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 4, 6, 7
work page internal anchor Pith review arXiv 2023
-
[29]
Grounded language-image pre-training
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In CVPR, 2022. 1
work page 2022
-
[30]
Gligen: Open-set grounded text-to-image generation
Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023. 1
-
[31]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV,
-
[32]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 9, 14 11
work page 2023
-
[33]
Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1
work page Pith review arXiv 2023
-
[34]
Learn to explain: Multimodal reasoning via thought chains for science question answering
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems ,
-
[35]
OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023. 1, 2
work page 2023
- [36]
-
[37]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2
work page 2022
-
[38]
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 1, 4
work page internal anchor Pith review arXiv 2023
-
[39]
Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021
Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050 , 2021. 1
-
[40]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 1, 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[41]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 2020. 2
work page 2020
-
[42]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[43]
Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. CVPR, pages 10674–10685, 2022. 1
work page 2022
-
[44]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487,
work page internal anchor Pith review arXiv
-
[45]
LAION-5B: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[46]
Vipergpt: Visual inference via python execution for reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 2
-
[47]
Habitat 2.0: Training home assistants to rearrange their habitat
Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...
work page 2021
- [48]
-
[49]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
arXiv preprint arXiv:2205.14100 , year=
Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 1
-
[51]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions. arXiv preprint arXiv:2212.10560, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[52]
arXiv preprint arXiv:2204.07705 , year=
Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022. 2
-
[53]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[54]
Unified contrastive learning in image-text-label space
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Lu Yuan, Ce Liu, and Jianfeng Gao. Unified contrastive learning in image-text-label space. CVPR, 2022. 1
work page 2022
-
[55]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[56]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022. 1
work page internal anchor Pith review arXiv 2022
-
[57]
Florence: A new foundation model for computer vision
Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. 1
-
[58]
A simple framework for open-vocabulary segmentation and detection
Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023. 1
-
[59]
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 2, 8, 9
work page Pith review arXiv 2023
-
[60]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[61]
Multimodal Chain-of-Thought Reasoning in Language Models
Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 8, 9
work page internal anchor Pith review arXiv 2023
-
[62]
Regionclip: Region-based language-image pretraining
Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022. 1
work page 2022
-
[63]
Sometimes I just look at pictures of the Earth from space and I marvel at how beautiful it all is
Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022. 1, 2 13 A Broader Impact The broader impact of LLaV A, a general-purpose visual assistant, has potential benefits and risks associated wi...
-
[64]
Gently toss them together to mix
In a large bowl, combine the sliced strawberries, blueberries, and carrot. Gently toss them together to mix
-
[65]
Drizzle the juice of half a lemon over the fruit and vegetables to prevent them from oxidizing and add a refreshing citrusflavor
-
[66]
Sprinkle the chopped fresh parsley or mint over the fruit and gently toss once more
-
[67]
Adjust the amount according to your taste preferences
If desired, add a drizzle of honey or maple syrup for a touch of sweetness. Adjust the amount according to your taste preferences
-
[68]
This allows the flavors to meld and creates a more refreshing salad
Cover the bowl and refrigerate the fruit salad for at least 30 minutes or up to a few hours. This allows the flavors to meld and creates a more refreshing salad
-
[69]
Serve the fruit salad in cups or bowls, garnished with a sprinkle of fresh parsley or mint, if desired. This fruit salad recipe is perfect for a healthy snack or side dish, and with the variety of ingredients available in your refrigerator, it should be easy to put together quickly. I would like to do the fruit salad. Show me the recipe. LLaVA LLaVA User ...
-
[70]
Instructions to launch the demo: link
-
[71]
All prompts and few shot examples for querying GPT-4: link
-
[72]
LLaV A-Instruct-158K: link
-
[73]
LLaV A-Bench: COCO, In-The-Wild
-
[74]
Model checkpoints. The size of the model checkpoints after compression is 25GB, which exceeds the 5GB limit of GitHub LFS (Large File Storage). We’ll release the checkpoint to the public, or upon request with reviewers for this submission. E Data Instructions for brief image description. The list of instructions used to briefly describe the image content ...
-
[75]
Cross-Country Skiing: Also known as Nordic skiing, cross-country skiing involves traversing relatively flat terrain or gentle slopes using skis and poles. It is a popular recreational and competitive sport that focuses on endurance, balance, and aerobic fitness. Cross-country skiers use their own locomotion to move across snow-covered terrain, rather than...
-
[76]
Downhill Skiing: Also known as alpine skiing, downhill skiing involves descending slopes at high speeds using skis and poles for balance and control. This sport is more focused on speed, technique, and navigating challenging terrain, including steep slopes, moguls, and even jumps. Downhill skiing can be further categorized into several disciplines, such a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.