EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
Frame-voyager: Learning to query frames for video large language models
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 8roles
method 1polarities
use method 1representative citing papers
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
citing papers explorer
-
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
-
GridProbe: Posterior-Probing for Adaptive Test-Time Compute in Long-Video VLMs
GridProbe uses posterior probing on a KxK frame grid to adaptively select question-relevant frames, delivering up to 3.36x TFLOPs reduction with accuracy within 1.6 pp of the full-frame baseline on Video-MME-v2.
-
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
-
Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
Q-Gate dynamically routes keyframe selection in long videos via query-modulated gating across visual grounding, global matching, and contextual alignment experts to improve MLLM performance.
-
Swift Sampling: Selecting Temporal Surprises via Taylor Series
Swift Sampling is a training-free frame selection method that uses Taylor expansions on video latent trajectories to pick temporally surprising frames, outperforming uniform sampling on long-video QA tasks.
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
- CREST: Curvature-Regulated Event-Centric Sampling for Efficient Long-Video Understanding