Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
Minerva: Evaluating complex video reasoning.arXiv preprint arXiv:2505.00681,
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
dataset 2polarities
use dataset 2representative citing papers
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
citing papers explorer
-
Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding
Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.
-
Act2See: Emergent Active Visual Perception for Video Reasoning
Act2See trains VLMs via supervised fine-tuning on verified reasoning traces to interleave active frame calls within text CoTs, yielding SOTA results on video reasoning benchmarks.
-
Seed1.8 Model Card: Towards Generalized Real-World Agency
Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.
-
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
-
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research frontiers with pilot insights.
-
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.