hub Baseline reference

Star: A benchmark for situated reasoning in real-world videos

Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, Chuang Gan · 2024 · arXiv 2405.09711

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

12 Pith papers citing it

Baseline 60% of classified citations

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 3 background 2

citation-polarity summary

use dataset 3 background 1 unclear 1

representative citing papers

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

cs.CV · 2026-05-11 · conditional · novelty 7.0 · 2 refs

TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

cs.CV · 2024-12-06 · unverdicted · novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

cs.CV · 2026-04-16 · unverdicted · novelty 5.0 · 2 refs

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

cs.AI · 2025-12-29 · unverdicted · novelty 5.0

A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

citing papers explorer

Showing 12 of 12 citing papers.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding cs.CV · 2026-05-11 · unverdicted · none · ref 55
EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 41 · 2 links
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 51
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 55
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models cs.CV · 2026-05-21 · unverdicted · none · ref 25
EvoVid proposes a temporal-centric self-evolution framework for Video-LLMs that uses temporal-aware Questioner and temporal-grounded Solver rewards to improve performance directly from unannotated videos.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 83
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
NICE FACT: Diagnosing and Calibrating VLMs in Quantitative Reasoning for Kinematic Physics cs.CV · 2026-05-08 · unverdicted · none · ref 4
VLMs fail to identify visual preconditions or apply physical laws in kinematic physics tasks, as shown by new FACT diagnostics and NICE calibration methods evaluated on six state-of-the-art models.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG cs.CV · 2026-04-07 · unverdicted · none · ref 74
VideoStir introduces a spatio-temporal graph-based structure and intent-aware retrieval for long-video RAG, achieving competitive performance with SOTA methods via a new IR-600K dataset.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling cs.CV · 2024-12-06 · unverdicted · none · ref 260
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 135
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CV · 2026-04-16 · unverdicted · none · ref 55 · 2 links
Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control cs.AI · 2025-12-29 · unverdicted · none · ref 25
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

Star: A benchmark for situated reasoning in real-world videos

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer