arXiv preprint arXiv:2511.20272 (2025)

Jiang, T · 2025 · arXiv 2511.20272

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).

Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

cs.CV · 2026-07-02 · unverdicted · novelty 4.0

TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

cs.CV · 2026-06-10 · unverdicted · novelty 4.0

InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction cs.CV · 2026-06-04 · unverdicted · none · ref 30
Future-L1 interleaves latent visual spans with text in MLLM decoding, trained on a custom Future-L1-50K dataset via LA-DAPO RL, and reports SOTA gains on FutureBench (61.0 to 85.4) and TwiFF-Bench (2.44 to 3.04).
Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning cs.CV · 2026-07-02 · unverdicted · none · ref 29
TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.
InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning cs.CV · 2026-06-10 · unverdicted · none · ref 289
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.

arXiv preprint arXiv:2511.20272 (2025)

fields

years

verdicts

representative citing papers

citing papers explorer