Frame-voyager: Learning to query frames for video large language models

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al · 2024 · arXiv 2410.03226

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.

CATS: Curvature Aware Temporal Selection for efficient long video understanding

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

cs.CL · 2026-04-07 · unverdicted · novelty 4.0

A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.

citing papers explorer

Showing 7 of 7 citing papers.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV · 2026-05-11 · unverdicted · none · ref 81
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs cs.CV · 2026-05-11 · unverdicted · none · ref 3
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
CATS: Curvature Aware Temporal Selection for efficient long video understanding cs.CV · 2026-05-09 · unverdicted · none · ref 13
CATS uses temporal curvature of query-frame relevance to select informative frames, achieving 93-95% of heavy multi-stage accuracy at 3-4% of the preprocessing cost on long-video benchmarks.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 56
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding cs.CV · 2026-04-19 · unverdicted · none · ref 43
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CV · 2026-05-01 · unverdicted · none · ref 28
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects cs.CL · 2026-04-07 · unverdicted · none · ref 12
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.

Frame-voyager: Learning to query frames for video large language models

fields

years

verdicts

representative citing papers

citing papers explorer