arxiv: 2304.08485 · v2 · submitted 2023-04-17 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Recognition: 3 theorem links

· Lean Theorem

Visual Instruction Tuning

Chunyuan Li, Haotian Liu, Qingyang Wu, Yong Jae Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-11 08:16 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords visual instruction tuningLLaVAmultimodal modelGPT-4vision-language understandinginstruction followingmultimodal chatScience QA

0 comments

The pith

Using GPT-4 to generate visual instruction data trains an end-to-end model that connects vision and language for general-purpose multimodal understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that language-only GPT-4 can produce multimodal instruction-following examples by turning image descriptions into question-answer pairs. Training LLaVA on this synthetic data creates a model that links a vision encoder directly to a large language model for joint visual and textual reasoning. This matters because it shows how to build capable vision-language systems without collecting large human-annotated image-text datasets. The resulting model handles open-ended chat on new images and instructions, reaching 85.1 percent of GPT-4's score on a held-out synthetic benchmark. When further tuned with GPT-4 on science questions, the combination sets a new accuracy record of 92.53 percent.

Core claim

By using language-only GPT-4 to generate multimodal language-image instruction-following data, the authors instruction-tune LLaVA, an end-to-end trained large multimodal model that connects a vision encoder and LLM. This produces strong zero-shot multimodal chat abilities that sometimes match multimodal GPT-4 behaviors on unseen images and instructions, an 85.1 percent relative score against GPT-4 on a synthetic instruction-following dataset, and a new state-of-the-art 92.53 percent accuracy on Science QA when the model is fine-tuned in synergy with GPT-4.

What carries the argument

LLaVA, the end-to-end trained large multimodal model that connects a vision encoder and LLM via instruction tuning on GPT-4 generated visual instruction data.

If this is right

LLaVA exhibits impressive multimodal chat abilities on unseen images and instructions, sometimes matching multimodal GPT-4 behaviors.
The model reaches an 85.1 percent relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset.
Fine-tuning LLaVA in synergy with GPT-4 produces a new state-of-the-art accuracy of 92.53 percent on Science QA.
The approach extends the benefits of instruction tuning from text-only LLMs into the multimodal setting using only language-based data generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The data-generation technique could be iterated to create larger or more specialized training sets for other visual reasoning tasks without additional human labeling.
Public release of the generated data, model weights, and code base allows direct measurement of how well the same pipeline transfers to new vision encoders or language models.
The observed synergy on Science QA suggests that hybrid systems pairing a tuned multimodal model with a stronger language model may outperform either component alone on grounded reasoning benchmarks.

Load-bearing premise

GPT-4 can generate sufficiently diverse, accurate, and representative multimodal instruction-following data from language-only inputs to support effective training and generalization to real images and instructions.

What would settle it

Testing LLaVA on a broad collection of real-world photographs paired with instructions that were never described in the GPT-4 data generation process and measuring whether accuracy falls substantially below the reported levels on the synthetic benchmark or Science QA.

read the original abstract

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaVA shows a workable pipeline for creating visual instruction data via text-only GPT-4, with public releases that help others, though the synthetic evaluation setup raises questions about generalization.

read the letter

Hi, The main takeaway here is that the authors show you can bootstrap a visual assistant by using language-only GPT-4 to create instruction data from image captions or descriptions, then fine-tune a vision encoder plus LLM model called LLaVA. This leads to decent chat performance and a high score on Science QA when combined with GPT-4. What they do well is release the generated dataset, the model weights, and the code publicly. That lowers the barrier for follow-up work. The method is simple and extends the instruction tuning idea from text LLMs to the multimodal case without requiring access to a full multimodal GPT-4 for data labeling. The reported 92.53% on Science QA is a concrete number that shows some synergy. The softer parts are around the evaluation. The 85.1% relative score is against GPT-4 on a held-out synthetic instruction dataset made the same way, which introduces circularity. Since the data generation uses only text prompts, it likely skips some visual specifics like precise spatial relations or subtle attributes that aren't in the captions. The abstract doesn't give full details on the prompt engineering, the image sources, or comparisons to other baselines, so it's tough to see exactly how much the visual component drives the results versus the underlying LLM. The concern about incomplete supervision from language-only generation looks real from the description. That said, the work is an early step and the public resources make it worth looking at. This is aimed at people building or studying multimodal LLMs. It has enough new elements and reproducibility to merit peer review, though it would benefit from more tests on out-of-distribution real images. I'd recommend sending it for review.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce LLaVA, the first large multimodal model trained by instruction tuning on data generated using language-only GPT-4. It reports that LLaVA exhibits impressive multimodal chat abilities, achieving an 85.1% relative score to GPT-4 on a synthetic multimodal instruction-following dataset, and reaches a new SOTA accuracy of 92.53% on Science QA through synergy with GPT-4. The GPT-4 generated data, model, and codebase are released publicly.

Significance. This is an early and influential attempt at visual instruction tuning, potentially opening a new direction for training general-purpose vision-language models. The public availability of the generated dataset and model is a notable strength that facilitates reproducibility and further research. If the generalization claims hold beyond synthetic data, it could significantly impact the development of multimodal assistants.

major comments (3)

Abstract: The reported 85.1% relative score compared with GPT-4 is evaluated on a synthetic multimodal instruction-following dataset generated using GPT-4. This setup introduces circularity, as the model is trained and tested on data from the same source, which does not provide strong evidence for generalization to real-world unseen images and instructions as claimed.
Abstract: The new state-of-the-art accuracy of 92.53% on Science QA is achieved via 'synergy of LLaVA and GPT-4', but the manuscript does not specify the exact integration method (e.g., whether GPT-4 is used for answer generation or verification), nor does it report LLaVA's standalone performance or comparisons to other fine-tuned models without GPT-4 assistance.
Data generation section (likely §3): The process of using language-only GPT-4 to generate multimodal instruction data is not described in sufficient detail, including the specific prompts, how image content is conveyed (e.g., via captions or other textual proxies), and measures to ensure diversity and accuracy. This is critical because any limitations in the generated data would directly affect the trained model's capabilities on real images.

minor comments (2)

Abstract: Typo: 'multimodel chat abilities' should be 'multimodal chat abilities'.
Abstract: The phrase 'early experiments' indicates preliminary results; consider adding a limitations section discussing potential issues with synthetic data and generalization.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each of the major comments below and indicate the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: The reported 85.1% relative score compared with GPT-4 is evaluated on a synthetic multimodal instruction-following dataset generated using GPT-4. This setup introduces circularity, as the model is trained and tested on data from the same source, which does not provide strong evidence for generalization to real-world unseen images and instructions as claimed.

Authors: We agree that evaluating on GPT-4-generated data introduces a degree of circularity. The 85.1% figure measures performance on a held-out test set of synthetic instructions, distinct from the training data. This serves as a proxy to assess how well LLaVA can follow complex multimodal instructions in a manner similar to GPT-4. To support claims of generalization, the manuscript includes qualitative demonstrations on real, unseen images and instructions, as well as results on the ScienceQA benchmark featuring authentic images and questions. We will revise the abstract to more precisely describe the evaluation setup and avoid overstating generalization based solely on this metric. revision: partial
Referee: Abstract: The new state-of-the-art accuracy of 92.53% on Science QA is achieved via 'synergy of LLaVA and GPT-4', but the manuscript does not specify the exact integration method (e.g., whether GPT-4 is used for answer generation or verification), nor does it report LLaVA's standalone performance or comparisons to other fine-tuned models without GPT-4 assistance.

Authors: Thank you for this observation. The synergy involves LLaVA generating candidate answers which are then refined or verified using GPT-4, but we recognize that the description lacks specificity. In the revision, we will provide a detailed explanation of the integration method in the relevant section, report the standalone accuracy of LLaVA on Science QA, and add comparisons to other fine-tuned vision-language models to better contextualize the results. revision: yes
Referee: Data generation section (likely §3): The process of using language-only GPT-4 to generate multimodal instruction data is not described in sufficient detail, including the specific prompts, how image content is conveyed (e.g., via captions or other textual proxies), and measures to ensure diversity and accuracy. This is critical because any limitations in the generated data would directly affect the trained model's capabilities on real images.

Authors: We concur that additional details on data generation are important for reproducibility and understanding potential limitations. The current manuscript outlines the high-level approach, but we will expand this section to include example prompts provided to GPT-4, clarify that image content is represented through rich textual captions and detected objects, and describe our efforts to ensure diversity (such as sampling varied image sources and instruction types) and accuracy (including manual inspection and filtering of generated data). These additions will help readers assess the quality of the training data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in claimed results or method

full rationale

The paper presents an empirical construction of LLaVA via GPT-4-generated instruction data followed by end-to-end training and reports results on both a held-out synthetic split and the external ScienceQA benchmark. No mathematical derivation, equations, or uniqueness theorems are invoked; the 85.1% relative score is an explicit comparison against the data generator on data from the same distribution, which is a standard (if limited) evaluation choice rather than a reduction by construction. No self-citations appear as load-bearing premises, and the central claim—that the resulting model exhibits GPT-4-like chat behavior on unseen images—rests on the training procedure and qualitative examples rather than any self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified assumption in the abstract that GPT-4 generated data is of sufficient quality for training a general-purpose multimodal model.

axioms (1)

domain assumption A pre-trained vision encoder can be effectively integrated with an LLM through instruction tuning for multimodal understanding.
The paper builds on the compatibility of vision encoders like CLIP and LLMs like Vicuna for end-to-end training.

pith-pipeline@v0.9.0 · 5483 in / 1494 out tokens · 84318 ms · 2026-05-11T08:16:23.763353+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
cs.AI 2026-04 unverdicted novelty 8.0

FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
cs.AI 2024-04 accept novelty 8.0

OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
cs.CV 2023-10 accept novelty 8.0

MathVista benchmark shows GPT-4V achieves 49.9% accuracy on visual mathematical reasoning tasks, outperforming other models but trailing humans by 10.4%.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
cs.CV 2026-05 unverdicted novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Allegory of the Cave: Measurement-Grounded Vision-Language Learning
cs.AI 2026-05 unverdicted novelty 7.0

PRISM-VL improves VLM performance by grounding on RAW-derived Meas.-XYZ inputs and exposure-bracketed supervision, gaining +0.1074 BLEU and +4.46% LLM-Judge accuracy over an RGB baseline on a held-out benchmark.
CATS: Curvature Aware Temporal Selection for efficient long video understanding
cs.CV 2026-05 unverdicted novelty 7.0

CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
cs.AI 2026-04 unverdicted novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
cs.CV 2026-04 unverdicted novelty 7.0

UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
Vision-Language Foundation Models for Comprehensive Automated Pavement Condition Assessment
cs.CV 2026-04 unverdicted novelty 7.0

Instruction-tuned vision-language model PaveGPT, trained on a large unified pavement dataset, achieves substantial gains over general models in comprehensive, standard-compliant pavement condition assessment.
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
cs.RO 2026-04 conditional novelty 7.0

StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.
MixAtlas: Uncertainty-aware Data Mixture Optimization for Multimodal LLM Midtraining
cs.LG 2026-04 unverdicted novelty 7.0

MixAtlas uses CLIP-based decomposition and Gaussian process optimization on small proxies to discover data mixtures that improve multimodal benchmark performance by up to 17.6% and transfer to larger models with faste...
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
cs.CV 2026-03 conditional novelty 7.0

OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
3D-VLA: A 3D Vision-Language-Action Generative World Model
cs.CV 2024-03 unverdicted novelty 7.0

3D-VLA is a new embodied foundation model that uses a 3D LLM plus aligned diffusion models to generate future images and point clouds for improved reasoning and action planning in 3D environments.
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
cs.CV 2024-01 conditional novelty 7.0

Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
cs.CV 2023-10 accept novelty 7.0

Set-of-Mark prompting marks segmented image regions with alphanumerics and masks to let GPT-4V achieve state-of-the-art zero-shot results on referring expression comprehension and segmentation benchmarks like RefCOCOg.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
cs.CL 2023-07 unverdicted novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
WizardLM: Empowering large pre-trained language models to follow complex instructions
cs.CL 2023-04 conditional novelty 7.0

WizardLM uses LLM-driven iterative rewriting to generate complex instruction data and fine-tunes LLaMA to reach over 90% of ChatGPT capacity on 17 of 29 evaluated skills.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
cs.CV 2023-03 conditional novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Personal Visual Context Learning in Large Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

Introduces Personal VCL formalization and benchmark revealing LMM context gaps, plus an Agentic Context Bank baseline that boosts personalized visual reasoning.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing
cs.CR 2026-05 unverdicted novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

COAST prunes 77.8% of visual tokens in LVLMs with a 2.15x speedup while keeping 98.64% of original performance by adaptively routing semantic and spatial context via contrastive scores.
ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction
cs.CV 2026-05 unverdicted novelty 6.0

ChartZero achieves zero-shot line chart data extraction by training only on synthetic mathematical functions, using a Global Orthogonal Instance loss to prevent curve fragmentation and a VLM-guided strategy for legend...
TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

TriRelVLA introduces triadic object-hand-task relational representations and a task-grounded graph transformer with a relational bottleneck to improve generalization in robotic manipulation across scenes, objects, and tasks.
Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology
cs.CV 2026-05 unverdicted novelty 6.0

MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.
VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models
cs.CR 2026-05 conditional novelty 6.0

Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.
HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering
cs.AI 2026-04 unverdicted novelty 6.0

HypEHR is a hyperbolic embedding model for EHR data that uses Lorentzian geometry and hierarchy-aware pretraining to answer clinical questions nearly as well as large language models but with much smaller size.
Text Steganography with Dynamic Codebook and Multimodal Large Language Model
cs.CR 2026-04 unverdicted novelty 6.0

A black-box text steganography method using a dynamic codebook generated by multimodal LLMs and reject-sampling feedback achieves higher embedding capacity and text quality than prior white-box and fixed-codebook blac...
SIF: Semantically In-Distribution Fingerprints for Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SIF creates semantically in-distribution fingerprints for LVLMs by distilling text watermarks into visual inputs and optimizing for robustness against detection and modification.
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
cs.CV 2026-04 unverdicted novelty 6.0

AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to nove...
See Fair, Speak Truth: Equitable Attention Improves Grounding and Reduces Hallucination in Vision-Language Alignment
cs.CV 2026-04 conditional novelty 6.0

Equitable attention via Dominant Object Penalty and Outlier Boost Coefficient reduces object hallucinations in multimodal LLMs without retraining.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Are We on the Right Way for Evaluating Large Vision-Language Models?
cs.CV 2024-03 conditional novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6...
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
cs.CV 2024-01 unverdicted novelty 6.0

Grounded SAM integrates Grounding DINO and SAM to support text-prompted open-world detection and segmentation, achieving 48.7 mean AP on SegInW zero-shot with the base detector and huge segmenter.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
MMBench: Is Your Multi-modal Model an All-around Player?
cs.CV 2023-07 accept novelty 6.0

MMBench is a new bilingual benchmark that uses curated questions, CircularEval, and LLM-assisted answer conversion to provide objective, fine-grained evaluation of vision-language models.
Kosmos-2: Grounding Multimodal Large Language Models to the World
cs.CL 2023-06 unverdicted novelty 6.0

Kosmos-2 grounds text to image regions by encoding refer expressions as Markdown links to sequences of location tokens and trains on a new GrIT dataset of grounded image-text pairs.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
cs.CV 2023-06 unverdicted novelty 6.0

MME is a manually annotated benchmark evaluating MLLMs on perception and cognition across 14 subtasks to avoid data leakage and support fair model comparisons.
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition
cs.CV 2026-05 unverdicted novelty 5.0

A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.
SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision
cs.CV 2026-05 unverdicted novelty 5.0

SuperFace refines ARKit facial expression estimation by using human preference feedback on rendered faces to optimize beyond noisy pseudo-label supervision from capture software.
Bolek: A Multimodal Language Model for Molecular Reasoning
cs.LG 2026-05 unverdicted novelty 5.0

Bolek injects Morgan fingerprint embeddings into an instruction-tuned text model, then fine-tunes on molecular alignment and synthetic chain-of-thought tasks to improve performance and grounding on 15 TDC binary class...
Qwen3.5-Omni Technical Report
cs.CL 2026-04 unverdicted novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding mul...
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
cs.CV 2026-04 unverdicted novelty 5.0

ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation
cs.CV 2026-05 unverdicted novelty 4.0

AtteConDA adds attention-based conflict suppression to multi-condition diffusion models so that generated driving-scene images retain richer structural cues from the original annotations.
Delineating Knowledge Boundaries for Honest Large Vision-Language Models
cs.CV 2026-04 unverdicted novelty 4.0

VLMs fine-tuned on a consistency-probed Visual-Idk dataset via SFT and preference optimization raise truthful rate from 57.9% to 67.3% and show internal evidence of genuine boundary recognition.
DAT: Dual-Aware Adaptive Transmission for Efficient Multimodal LLM Inference in Edge-Cloud Systems
cs.MM 2026-04 unverdicted novelty 4.0

DAT combines a small-large model cascade with fine-tuning and bandwidth-aware multi-stream transmission to deliver high-accuracy event recognition and low-latency alerts for video streams in edge-cloud systems.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 67 Pith papers · 18 internal anchors

[1]

https://github.com/hwchase17/langchain, 2022

Langchain. https://github.com/hwchase17/langchain, 2022. 2

work page 2022
[2]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. 2, 4

work page internal anchor Pith review arXiv 2022
[3]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition , 2018. 2

work page 2018
[4]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021. 1

work page internal anchor Pith review arXiv 2021
[5]

Openflamingo, March 2023

Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo, March 2023. 2, 6, 7

work page 2023
[6]

In- structpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instruct pix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022. 2

work page arXiv 2022
[7]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020. 2

work page 1901
[8]

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021. 2

work page 2021
[9]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 1, 2, 4, 5, 6

work page 2023
[10]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022. 2

work page internal anchor Pith review arXiv 2022
[12]

Computer vision in the wild

CVinW. Computer vision in the wild. https://github.com/ Computer-Vision-in-the-Wild/CVinW_Readings , 2022. 1

work page 2022
[13]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement

Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, and Oncel Tuzel. Reinforce data, multiply impact: Improved model accuracy and robustness with dataset reinforcement. arXiv preprint arXiv:2303.08983, 2023. 2

work page arXiv 2023
[15]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors.ArXiv, abs/2203.13131,

work page arXiv
[16]

Vision- language pre-training: Basics, recent advances, and future trends

Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao, et al. Vision- language pre-training: Basics, recent advances, and future trends. F oundations and Trends® in Computer Graphics and Vision, 2022. 1

work page 2022
[17]

Chatgpt outperforms crowd-workers for text-annotation tasks

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023. 3

work page arXiv 2023
[18]

Visual Programming: Compositional visual reasoning without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. arXiv preprint arXiv:2211.11559, 2022. 2

work page arXiv 2022
[19]

Towards learning a generic agent for vision-and-language navigation via pre-training

Weituo Hao, Chunyuan Li, Xiujun Li, Lawrence Carin, and Jianfeng Gao. Towards learning a generic agent for vision-and-language navigation via pre-training. In CVPR, 2020. 2

work page 2020
[20]

Language is not all you need: Aligning perception with language models

Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023. 2

work page arXiv 2023
[21]

Openclip

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip. July 2021. If you use this software, please cite it as below. 1

work page 2021
[22]

Opt-iml: Scaling language model instruction meta learning through the lens of generalization.arXiv preprint arXiv:2212.12017, 2022

Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Dániel Simig, Ping Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017,

work page arXiv
[23]

Visual prompt tuning

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV, 2022. 2

work page 2022
[24]

Grounding language models to images for multimodal generation

Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023. 2

work page arXiv 2023
[25]

Language- driven semantic segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. Language- driven semantic segmentation. ICLR, 2022. 1

work page 2022
[26]

Multimodal founda- tion models: From specialists to general-purpose assistants

Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 2023. 1

work page arXiv 2023
[27]

ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models

Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEV ATER: A bench- mark and toolkit for evaluating language-augmented visual models. In NeurIPS Track on Datasets and Benchmarks, 2022. 1

work page 2022
[28]

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. 1, 2, 4, 6, 7

work page internal anchor Pith review arXiv 2023
[29]

Grounded language-image pre-training

Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. In CVPR, 2022. 1

work page 2022
[30]

Gligen: Open-set grounded text-to-image generation

Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023. 1

work page arXiv 2023
[31]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV,

work page
[32]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023. 9, 14 11

work page 2023
[33]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023. 1

work page Pith review arXiv 2023
[34]

Learn to explain: Multimodal reasoning via thought chains for science question answering

Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems ,

work page
[35]

OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023. 1, 2

work page 2023
[36]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023. 1, 5, 6, 15

work page 2023
[37]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022. 2

work page 2022
[38]

Instruction Tuning with GPT-4

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with GPT-4. arXiv preprint arXiv:2304.03277, 2023. 1, 4

work page internal anchor Pith review arXiv 2023
[39]

Combined scaling for zero-shot transfer learning.arXiv preprint arXiv:2111.10050, 2021

Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050 , 2021. 1

work page arXiv 2021
[40]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021. 1, 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research , 2020. 2

work page 2020
[42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. CVPR, pages 10674–10685, 2022. 1

work page 2022
[44]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487,

work page internal anchor Pith review arXiv
[45]

LAION-5B: An open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022. 2

work page internal anchor Pith review arXiv 2022
[46]

Vipergpt: Visual inference via python execution for reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023. 2

work page arXiv 2023
[47]

Habitat 2.0: Training home assistants to rearrange their habitat

Andrew Szot, Alex Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir V ondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, and Dhruv Batra. Habitat 2.0: Training home assist...

work page 2021
[48]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023. 1, 4 12

work page 2023
[49]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

arXiv preprint arXiv:2205.14100 , year=

Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022. 1

work page arXiv 2022
[51]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instruc- tions. arXiv preprint arXiv:2212.10560, 2022. 2

work page internal anchor Pith review arXiv 2022
[52]

arXiv preprint arXiv:2204.07705 , year=

Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2022. 2

work page arXiv 2022
[53]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023. 2

work page internal anchor Pith review arXiv 2023
[54]

Unified contrastive learning in image-text-label space

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Lu Yuan, Ce Liu, and Jianfeng Gao. Unified contrastive learning in image-text-label space. CVPR, 2022. 1

work page 2022
[55]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023. 2

work page internal anchor Pith review arXiv 2023
[56]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022. 1

work page internal anchor Pith review arXiv 2022
[57]

Florence: A new foundation model for computer vision

Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021. 1

work page arXiv 2021
[58]

A simple framework for open-vocabulary segmentation and detection

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, and Lei Zhang. A simple framework for open-vocabulary segmentation and detection. arXiv preprint arXiv:2303.08131, 2023. 1

work page arXiv 2023
[59]

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023. 2, 8, 9

work page Pith review arXiv 2023
[60]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[61]

Multimodal Chain-of-Thought Reasoning in Language Models

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multi- modal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023. 8, 9

work page internal anchor Pith review arXiv 2023
[62]

Regionclip: Region-based language-image pretraining

Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803, 2022. 1

work page 2022
[63]

Sometimes I just look at pictures of the Earth from space and I marvel at how beautiful it all is

Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. arXiv preprint arXiv:2212.11270, 2022. 1, 2 13 A Broader Impact The broader impact of LLaV A, a general-purpose visual assistant, has potential benefits and risks associated wi...

work page arXiv 2022
[64]

Gently toss them together to mix

In a large bowl, combine the sliced strawberries, blueberries, and carrot. Gently toss them together to mix

work page
[65]

Drizzle the juice of half a lemon over the fruit and vegetables to prevent them from oxidizing and add a refreshing citrusflavor

work page
[66]

Sprinkle the chopped fresh parsley or mint over the fruit and gently toss once more

work page
[67]

Adjust the amount according to your taste preferences

If desired, add a drizzle of honey or maple syrup for a touch of sweetness. Adjust the amount according to your taste preferences

work page
[68]

This allows the flavors to meld and creates a more refreshing salad

Cover the bowl and refrigerate the fruit salad for at least 30 minutes or up to a few hours. This allows the flavors to meld and creates a more refreshing salad

work page
[69]

This fruit salad recipe is perfect for a healthy snack or side dish, and with the variety of ingredients available in your refrigerator, it should be easy to put together quickly

Serve the fruit salad in cups or bowls, garnished with a sprinkle of fresh parsley or mint, if desired. This fruit salad recipe is perfect for a healthy snack or side dish, and with the variety of ingredients available in your refrigerator, it should be easy to put together quickly. I would like to do the fruit salad. Show me the recipe. LLaVA LLaVA User ...

work page
[70]

Instructions to launch the demo: link

work page
[71]

All prompts and few shot examples for querying GPT-4: link

work page
[72]

LLaV A-Instruct-158K: link

work page
[73]

LLaV A-Bench: COCO, In-The-Wild

work page
[74]

Describe the image concisely

Model checkpoints. The size of the model checkpoints after compression is 25GB, which exceeds the 5GB limit of GitHub LFS (Large File Storage). We’ll release the checkpoint to the public, or upon request with reviewers for this submission. E Data Instructions for brief image description. The list of instructions used to briefly describe the image content ...

work page
[75]

It is a popular recreational and competitive sport that focuses on endurance, balance, and aerobic fitness

Cross-Country Skiing: Also known as Nordic skiing, cross-country skiing involves traversing relatively flat terrain or gentle slopes using skis and poles. It is a popular recreational and competitive sport that focuses on endurance, balance, and aerobic fitness. Cross-country skiers use their own locomotion to move across snow-covered terrain, rather than...

work page
[76]

This sport is more focused on speed, technique, and navigating challenging terrain, including steep slopes, moguls, and even jumps

Downhill Skiing: Also known as alpine skiing, downhill skiing involves descending slopes at high speeds using skis and poles for balance and control. This sport is more focused on speed, technique, and navigating challenging terrain, including steep slopes, moguls, and even jumps. Downhill skiing can be further categorized into several disciplines, such a...

work page