hub Mixed citations

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin · 2023 · cs.CV · arXiv 2311.10122

Mixed citation behavior. Most common role is background (62%).

68 Pith papers citing it

Background 62% of classified citations

open full Pith review browse 68 citing papers arXiv PDF

abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 baseline 3 method 2 other 1

citation-polarity summary

background 8 baseline 3 unclear 1 use method 1

claims ledger

abstract The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the found
baseline "×" indicates the model is incapable of performing the task. Model Understanding Image Generation Image Editing MMBV MMBI MMMU MM-Vet GenEval WISE Overall Add Adjust Extract Replace Remove Hybird Image Understanding LLaV A-1.5 [25] × 36.4 67.8 36.3 × × × × × × × × × LLaV A-NeXT [57] × 79.3 51.1 57.4 × × × × × × × × × Image & Video Understanding Video-LLaV A [22] 1.05 60.9 32.8 32.0 × × × × × × × × × LLaV A-OV [17] 0.94 80.8 48.8 57.5 × × × × × × × × × Text-to-Image Generation SDXL [34] × × × × 0
method backbone [13], as illustrated in Figure 2. The edit instruction and the original image are jointly fed into VLM, while the image is processed simultaneously by the vision encoder. The hidden states of VLM and the visual feature of the vision encoder are separately projected by MLPs and then concatenated, forming the text-branch input to DiT. Training proceeds in two stages [41], first optimizing MLPs and then jointly fine-tuning FLUX and MLPs. 3.4 Dataset Statistics ImgEdit comprises 1.2 million
background SigLIP outperforms the other two vision encoders, especially in fine-grained understanding tasks involving texts. Based on this ablation study, we choose the pretrained SigLIP as our base vision encoder, and then adapt it to taking dynamic resolutions as inputs. 5 Related Work Multimodal LLMs for Native Video Understanding. Early video MLLMs primarily relied on sparsely sampled frames and simple connectors, such as MLPs [12, 13, 139], discrete visual tokenizers [140], and Q-formers [141, 142], t
background SpatialLadder: Progressive training for spatial reasoning in vision-language models.arXiv, abs/2510.08531, 2025. [31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InInternational Conference on Machine Learning, volume 202, pages 19730-19742, 2023. [32] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-LLaV A: Learning united visual representation by alignment
method [74] Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization, 2019. 40 [75] Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning, 2023. 39 [76] Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united
background language models. InECCV, 2024. 3 [37] Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mapsparse: Accelerating pre-filling for long-context visual language models via modality-aware permutation sparse attention. InICLR 2025 Workshop on Foundation Models in the Wild. 5 [38] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alig

co-cited works

representative citing papers

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection

cs.CV · 2025-12-11 · conditional · novelty 8.0

RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when watermarks are spoofed onto real videos.

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.

WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language

cs.NI · 2026-05-13 · unverdicted · novelty 7.0

WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

Grounding Video Reasoning in Physical Signals

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning

cs.CV · 2026-01-22 · unverdicted · novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.

LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning

cs.CV · 2026-01-21 · conditional · novelty 7.0

LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

cs.CV · 2026-01-04 · unverdicted · novelty 7.0

GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

cs.CV · 2025-12-01 · unverdicted · novelty 7.0

StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?

cs.CV · 2025-11-27 · unverdicted · novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.

TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?

cs.CV · 2025-09-19 · unverdicted · novelty 7.0

Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

cs.CV · 2025-02-06 · unverdicted · novelty 7.0

WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

cs.CV · 2024-10-22 · accept · novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

cs.CV · 2026-05-20 · conditional · novelty 6.0

SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

citing papers explorer

Showing 50 of 68 citing papers.

RobustSora: De-Watermarked Benchmark for Robust AI-Generated Video Detection cs.CV · 2025-12-11 · conditional · none · ref 5 · internal anchor
RobustSora benchmark demonstrates that current AI video detectors rely heavily on visible watermarks, with average accuracy drops of 6.6 percentage points when watermarks are erased and increased false alarms when watermarks are spoofed onto real videos.
AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 23 · internal anchor
CoRDS selects a compact KV-cache subset via joint-space coreset coverage and log-det diversity to outperform token-wise heuristics on long-video VLM benchmarks.
WirelessSenseLLM: Zero-Shot Human Activity Understanding by Bridging Wireless Signals and Human Language cs.NI · 2026-05-13 · unverdicted · none · ref 16 · internal anchor
WirelessSenseLLM bridges unsegmented Wi-Fi CSI signals to LLMs via a CSI-to-Language Adapter for zero-shot human activity understanding and reasoning.
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 56 · internal anchor
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
EyeCue: Driver Cognitive Distraction Detection via Gaze-Empowered Egocentric Video Understanding cs.CV · 2026-05-08 · unverdicted · none · ref 28 · internal anchor
EyeCue detects driver cognitive distraction by modeling gaze-visual context interactions in egocentric videos and achieves 74.38% accuracy on the new CogDrive dataset, outperforming 11 baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 13 · internal anchor
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
Grounding Video Reasoning in Physical Signals cs.CV · 2026-04-23 · unverdicted · none · ref 15 · internal anchor
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robustness and weak spatial performance.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic cs.AI · 2026-04-21 · unverdicted · none · ref 12 · internal anchor
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 17 · internal anchor
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 26 · internal anchor
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning cs.CV · 2026-01-22 · unverdicted · none · ref 19 · internal anchor
VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning cs.CV · 2026-01-21 · conditional · none · ref 16 · internal anchor
LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV · 2026-01-04 · unverdicted · none · ref 38 · internal anchor
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos cs.CV · 2025-12-01 · unverdicted · none · ref 12 · internal anchor
StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? cs.CV · 2025-11-27 · unverdicted · none · ref 38 · internal anchor
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies? cs.CV · 2025-09-19 · unverdicted · none · ref 23 · internal anchor
Introduces TennisTV benchmark for evaluating 17 MLLMs on tennis video understanding from stroke-level to rally-level tasks with automated pipelines and human verification.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 35 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction cs.CV · 2024-10-22 · accept · none · ref 27 · internal anchor
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 28 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly cs.CV · 2026-05-20 · unverdicted · none · ref 14 · internal anchor
Flat-Pack Bench is a new evaluation suite that shows state-of-the-art LVLMs perform poorly on nuanced spatio-temporal reasoning required for furniture assembly videos.
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models cs.CV · 2026-05-20 · conditional · none · ref 26 · internal anchor
SPpruner reduces visual tokens in VLMs via focus identification followed by context-aware scanning, retaining 22.2% tokens for 2.53x speedup on Qwen2.5-VL with negligible accuracy loss.
Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 52 · internal anchor
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 7 · internal anchor
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs cs.CV · 2026-05-10 · unverdicted · none · ref 32 · internal anchor
SpaceMind++ adds an explicit voxelized allocentric cognitive map and coordinate-guided fusion to video MLLMs, claiming SOTA on VSI-Bench and improved out-of-distribution generalization on three other 3D benchmarks.
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization cs.CV · 2026-05-04 · unverdicted · none · ref 23 · internal anchor
WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection cs.CV · 2026-04-27 · unverdicted · none · ref 30 · internal anchor
ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels cs.LG · 2026-04-17 · unverdicted · none · ref 10 · internal anchor
UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 38 · internal anchor
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 28 · internal anchor
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning cs.AI · 2026-04-13 · unverdicted · none · ref 22 · internal anchor
CFMS is a coarse-to-fine framework that uses MLLMs to create a multi-perspective knowledge tuple as a reasoning map for symbolic table operations, yielding competitive accuracy on WikiTQ and TabFact.
Spatio-Temporal Grounding of Large Language Models from Perception Streams cs.RO · 2026-04-08 · unverdicted · none · ref 17 · internal anchor
FESTS uses Spatial Regular Expressions compiled from queries to generate 27k training tuples that raise a 3B-parameter LLM's frame-level F1 on spatio-temporal video reasoning from 48.5% to 87.5%, matching GPT-4.1 while staying far smaller.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 6 · internal anchor
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
ForestPrune: High-ratio Visual Token Compression for Video Multimodal Large Language Models via Spatial-Temporal Forest Modeling cs.CV · 2026-03-24 · unverdicted · none · ref 29 · internal anchor
ForestPrune prunes 90% of visual tokens in video MLLMs like LLaVA-OneVision while retaining 95.8% accuracy by modeling tokens as spatial-temporal forests and scoring importance via tree depth and node roles.
Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy cs.AI · 2026-03-02 · unverdicted · none · ref 35 · internal anchor
Nano-EmoX is a compact 2.2B multimodal model that unifies six core affective tasks across perception, understanding, and interaction levels via a curriculum framework, achieving competitive benchmark performance.
Structure Over Scale: Learning Visual Reasoning from Pedagogical Video cs.CV · 2026-01-30 · unverdicted · none · ref 5 · internal anchor
Fine-tuning VLMs on 10K QA pairs from pedagogical children's videos produces consistent gains on NExT-QA, Video-MME, and MotionBench, indicating that explicit structure can substitute for data scale.
Streaming Video Instruction Tuning cs.CV · 2025-12-24 · unverdicted · none · ref 2 · internal anchor
Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval cs.CV · 2025-12-09 · unverdicted · none · ref 24 · internal anchor
OneClip-RAG enables MLLMs to handle long videos via one-shot clip retrieval and unified chunking-retrieval, delivering performance gains like matching GPT-5 level on MLVU with high efficiency on standard GPUs.
GA2-CLIP: Generic Attribute Anchor for Efficient Prompt Tuningin Video-Language Models cs.CV · 2025-11-27 · unverdicted · none · ref 24 · internal anchor
GA2-CLIP uses generic attribute anchors and coupled hard-soft prompts to preserve generalization in prompt-tuned video-language models on base-to-new class tasks.
Boosting Reasoning in Large Multimodal Models via Activation Replay cs.CV · 2025-11-25 · unverdicted · none · ref 21 · internal anchor
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence cs.CV · 2025-05-29 · unverdicted · none · ref 9 · 2 links · internal anchor
Spatial-MLLM adds a 3D spatial encoder initialized from a visual geometry model and space-aware frame sampling to MLLMs to improve spatial understanding and reasoning from purely 2D visual inputs.
ImgEdit: A Unified Image Editing Dataset and Benchmark cs.CV · 2025-05-26 · conditional · none · ref 41 · internal anchor
ImgEdit supplies 1.2 million curated edit pairs and a three-part benchmark that let a VLM-based model outperform prior open-source editors on adherence, quality, and detail preservation.
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval cs.CV · 2025-05-21 · unverdicted · none · ref 17 · internal anchor
LiveVLM introduces VSB and PaR to compress and retrieve KV cache in streaming video LLMs, enabling LLaVA-OneVision to reach SOTA accuracy among training-free query-agnostic and training-based online models.
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO cs.CV · 2025-03-12 · unverdicted · none · ref 9 · internal anchor
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 53 · internal anchor
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 16 · internal anchor
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos cs.CV · 2024-08-19 · unverdicted · none · ref 20 · internal anchor
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 35 · internal anchor
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 101 · internal anchor
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models cs.CV · 2024-01-29 · conditional · none · ref 21 · internal anchor
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer