StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
hub
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
23 Pith papers cite this work. Polarity classification is still indexing.
abstract
The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M. In our VIDAL-10M, all videos are from short video platforms with complete semantics rather than truncated segments from long videos, and all the video, depth, infrared, and audio modalities are aligned to their textual descriptions. LanguageBind has achieved superior performance on a wide range of 15 benchmarks covering video, audio, depth, and infrared. Moreover, multiple experiments have provided evidence for the effectiveness of LanguageBind in achieving indirect alignment and complementarity among diverse modalities. Code address: https://github.com/PKU-YuanGroup/LanguageBind
hub tools
citation-role summary
citation-polarity summary
roles
background 4representative citing papers
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.
MMLandmarks supplies 197k aerial and 329k ground images plus text and GPS for 18,557 landmarks to benchmark multimodal geo-spatial understanding.
Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
Contrastive Fusion (ConFu) adds a fused-modality contrastive term to jointly align individual modalities and their combinations, enabling capture of higher-order dependencies like XOR relations while preserving pairwise alignments.
CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.
Open-Sora Plan presents an open-source large video generation model that combines a Wavelet-Flow VAE, Joint Image-Video Skiparse Denoiser, and multi-dimensional data curation to achieve high-quality video outputs with public code and weights.
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
citing papers explorer
-
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.