In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018),https://arxiv.org/abs/1809.0169620

Lei, J · 2018 · cs.CL · arXiv 1809.01696

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.

representative citing papers

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

cs.CV · 2023-03-28 · conditional · novelty 7.0

LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

cs.CL · 2026-04-23 · unverdicted · novelty 5.0

AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.

citing papers explorer

Showing 6 of 6 citing papers.

MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 21 · internal anchor
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention cs.CV · 2023-03-28 · conditional · none · ref 137 · internal anchor
LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · unverdicted · none · ref 14
SpaMEM benchmark shows multimodal LLMs succeed at spatial tasks with text histories but sharply fail at long-horizon belief maintenance from raw visual streams alone.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 30
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 179
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA cs.CL · 2026-04-23 · unverdicted · none · ref 5
AUDITA is a challenging audio QA benchmark where humans score 32% accuracy on average while state-of-the-art models score below 9%, using IRT to reveal systematic model deficiencies.

In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2018),https://arxiv.org/abs/1809.0169620

fields

years

verdicts

representative citing papers

citing papers explorer