hub

Languagebind: Extending video- language pretraining to n-modality by language-based se- mantic alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, et al · 2023 · arXiv 2310.01852

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning

cs.AI · 2026-04-25 · unverdicted · novelty 7.0

StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

cs.IR · 2026-05-11 · unverdicted · novelty 6.0

ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

cs.AI · 2026-04-13 · unverdicted · novelty 6.0 · 2 refs

EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.

Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning

cs.SD · 2026-04-09 · unverdicted · novelty 6.0

TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

cs.CV · 2025-06-03 · unverdicted · novelty 5.0

UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

cs.CV · 2023-12-21 · unverdicted · novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.

ClimateVID -- Social Media Videos Analysis and Challenges Involved

cs.CV · 2026-04-30 · unverdicted · novelty 4.0

Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

cs.CV · 2024-06-11 · unverdicted · novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

citing papers explorer

Showing 13 of 13 citing papers.

StoryTR: Narrative-Centric Video Temporal Retrieval with Theory of Mind Reasoning cs.AI · 2026-04-25 · unverdicted · none · ref 26
StoryTR is a new benchmark and agentic data pipeline that adds explicit Theory of Mind reasoning chains to train smaller video retrieval models, yielding a 15% relative IoU gain over larger baselines on narrative content.
Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic cs.AI · 2026-04-21 · unverdicted · none · ref 18
SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.
PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction cs.CV · 2026-04-09 · unverdicted · none · ref 97
PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 62
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
ReCoVR: Closing the Loop in Interactive Composed Video Retrieval cs.IR · 2026-05-11 · unverdicted · none · ref 58
ReCoVR introduces a reflexive dual-pathway architecture for interactive composed video retrieval that outperforms baselines by combining intent routing with trajectory-level reflection on retrieval history.
EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models cs.AI · 2026-04-13 · unverdicted · none · ref 62 · 2 links
EmergentBridge enhances zero-shot cross-modal performance on unpaired modalities by learning noisy bridge anchors from existing alignments and enforcing proxy alignment only in the orthogonal subspace to avoid gradient interference.
Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning cs.SD · 2026-04-09 · unverdicted · none · ref 37
TG-DP decouples reconstruction and alignment objectives into separate paths with teacher guidance on visibility patterns, yielding SOTA zero-shot audio-video retrieval gains on AudioSet.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 107
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities cs.CV · 2026-04-09 · unverdicted · none · ref 68
Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation cs.CV · 2025-06-03 · unverdicted · none · ref 60
UniWorld-V1 shows that semantic features from large multimodal models enable unified visual understanding and generation, achieving strong results on perception and manipulation tasks with only 2.7 million training samples.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 187
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
ClimateVID -- Social Media Videos Analysis and Challenges Involved cs.CV · 2026-04-30 · unverdicted · none · ref 81
Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 58
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Languagebind: Extending video- language pretraining to n-modality by language-based se- mantic alignment

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer