Tarsier: Recipes for training and evaluating large video description models

Jiawei Wang, Liping Yuan, Yuchen Zhang, Haomiao Sun · 2024 · arXiv 2407.00634

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

cs.CV · 2024-06-11 · unverdicted · novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

citing papers explorer

Showing 6 of 6 citing papers.

Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction cs.CV · 2026-05-14 · unverdicted · none · ref 100
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 61
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 9
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 48
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 140
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs cs.CV · 2024-06-11 · unverdicted · none · ref 48
VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.

Tarsier: Recipes for training and evaluating large video description models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer