pith. sign in

hub Mixed citations

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Mixed citation behavior. Most common role is background (62%).

68 Pith papers citing it
Background 62% of classified citations
abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

hub tools

citation-role summary

background 7 baseline 3 method 2 other 1

citation-polarity summary

claims ledger

  • abstract The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found
  • baseline "×" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0
  • method backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million
  • background SigLIP outperforms the other two vision encoders, especially in fine-grained understanding tasks involving texts. Based on this ablation study, we choose the pretrained SigLIP as our base vision encoder, and then adapt it to taking dynamic resolutions as inputs. 5 Related Work Multimodal LLMs for Native Video Understanding. Early video MLLMs primarily relied on sparsely sampled frames and simple connectors, such as MLPs [12, 13, 139], discrete visual tokenizers [140], and Q-formers [141, 142], t
  • background SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025. [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730-19742, 2023. [32] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment
  • method [74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united
  • background language models. InECCV, 2024. 3 [37] Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5 [38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alig

co-cited works

clear filters

representative citing papers

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

cs.CV · 2025-12-11 · conditional · novelty 8.0

RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when watermarks are spoofed onto real videos.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

Grounding Video Reasoning in Physical Signals

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

citing papers explorer

Showing 7 of 7 citing papers after filters.