Frame-voyager: Learning to query frames for video large language models

Sicheng Yu, Chengkai Jin, Huanyu Wang, Zhenghao Chen, Sheng Jin, Zhongrong Zuo, Xiaolei Xu, Zhenbang Sun, Bingni Zhang, Jiawei Wu, et al · 2024 · arXiv 2410.03226

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.

Swift Sampling: Selecting Temporal Surprises via Taylor Series

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.

Scaling Video Understanding via Compact Latent Multi-Agent Collaboration

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.

CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding

cs.CV · 2026-05-09 · unverdicted · novelty 4.0

CREST uses local curvature of query-frame relevance over time to select informative frames, outperforming a lightweight baseline and approaching a costly pipeline at far lower preprocessing cost on long-video benchmarks.

Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects

cs.CL · 2026-04-07 · unverdicted · novelty 4.0

A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.

citing papers explorer

Showing 1 of 1 citing paper after filters.

OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning cs.CV · 2026-04-18 · unverdicted · none · ref 56
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.

Frame-voyager: Learning to query frames for video large language models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer