pith. machine review for the scientific record. sign in

arxiv: 2605.08412 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

SYNCR: A Cross-Video Reasoning Benchmark with Synthetic Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal large language modelscross-video reasoningsynthetic benchmarkspatial trackingtemporal alignmentphysical reasoningvideo understandingMLLM evaluation
0
0 comments X

The pith

Current MLLMs reach only 52.5 percent accuracy on cross-video reasoning tasks compared to an 89.5 percent human baseline.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SYNCR as a new synthetic benchmark to measure how well multimodal large language models can reason across multiple independent video streams. It generates controlled videos with exact spatial, temporal, and physical ground truth using simulators, then defines eight specific tasks that test temporal alignment, spatial tracking, comparative reasoning, and holistic synthesis. Zero-shot evaluations show leading models fall well below human performance, with the largest gaps appearing in precise physical and spatial judgments rather than simple ordering. The work also checks that scaling model size and adding reasoning-focused training improve some temporal skills but leave fine-grained tracking and global synthesis largely unaddressed. A preliminary comparison finds that performance patterns on SYNCR partly match trends seen on real-world multi-video tests.

Core claim

SYNCR supplies 8,163 multi-video question-answer pairs across 9,650 unique synthetic videos, each with programmatically verified grounding, to diagnose MLLM capabilities on four diagnostic pillars. The best evaluated model attains 52.5 percent average accuracy while humans reach 89.5 percent, performing adequately on temporal ordering yet only 26.0 percent on kinematic comparison. Parameter scaling and reasoning post-training strengthen temporal alignment but do not consistently improve physical tracking or global spatial synthesis. Exploratory analysis indicates several SYNCR tasks track model-level trends on existing real-world multi-video benchmarks.

What carries the argument

The SYNCR benchmark, which generates synthetic videos via Habitat, Kubric, and CLEVRER simulators and supplies programmatically verified ground truth for eight cross-video tasks.

Load-bearing premise

The synthetic videos and programmatically defined tasks accurately capture the core reasoning challenges that arise in real-world multi-video scenarios.

What would settle it

Demonstrating that models scoring high on SYNCR perform poorly on real-world multi-video benchmarks, or that human accuracy on SYNCR fails to predict human accuracy on equivalent real footage, would undermine the benchmark's claimed diagnostic value.

Figures

Figures reproduced from arXiv: 2605.08412 by Farshad Khorrami, Prashanth Krishnamurthy, Sara Ghazanfari, Siddharth Garg.

Figure 1
Figure 1. Figure 1: The SYNCR benchmark framework. SYNCR evaluates cross-video reasoning through four diagnostic pillars. Temporal Alignment tests synchronization and chronological ordering across unaligned streams. Spatial Tracking evaluates object permanence and cross-view geometry. Compar￾ative Reasoning measures relative physical or numerical properties across videos. Holistic Synthesis requires integrating fragmented obs… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of reasoning-specialized post￾training. Qwen3-VL-32B-Thinking improves Temporal Alignment and average accuracy over Qwen3-VL-32B-Instruct, but does not improve Comparative Reasoning, Spatial Tracking, or Holistic Synthesis. 1 0 1 2 3 4 Log of Model Scale (Billion Parameters) 20 25 30 35 40 45 50 55 ACC (%) on Total Average Gemini3 Flash (52.5) GPT-5.4 (50.5) 2B 4B 8B 32B 2B 4B 8B 38B 0.5B 7B 72B Qwe… view at source ↗
Figure 4
Figure 4. Figure 4: Plateaued scaling on SYNCR challenging tasks. Certain complex rea￾soning tasks resist the positive scaling trend, revealing persistent bottlenecks in physical and spatial-temporal reasoning [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Deterministic Ground Truth Generation in Habitat. Habitat provides synchronized RGB, depth, and semantic observations for each rendered camera pose. SYNCR uses these simulator￾derived annotations to construct pixel-level object visibility, semantic instance labels, and navigation￾based spatial structure for Habitat-based tasks. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Habitat Object Re-identification prompt template. The model is given two Habitat videos and must identify the first timestamp window in Video 2 where the same semantic instance appears. questions. After generation, we rebalance the object-counting set by subsampling single-instance questions, which are otherwise overrepresented in indoor scenes. Answer choices are generated by including the true count and … view at source ↗
Figure 7
Figure 7. Figure 7: Habitat Object Counting prompt template. The model must aggregate observations across multiple Habitat videos and count unique semantic instances without double-counting repeated views of the same object. Route Planning path construction and distractors. For Route Planning, we read the route metadata associated with each generated route and reconstruct the global path by concatenating per-camera route chun… view at source ↗
Figure 8
Figure 8. Figure 8: Habitat Route Planning prompt template. The model must infer the shortest navigable path between two regions by integrating partial spatial observations across multiple Habitat videos. Simulation Environment and Asset Sampling. To ensure visual variety, the dataset is evenly split across two shape vocabularies: 1,000 scenes utilize the standard CLEVR shapes (cube, cylinder, sphere), and the remaining 1,000… view at source ↗
Figure 9
Figure 9. Figure 9: Kubric Multi-Angle Synchronization prompt template. The model is given three cropped videos of the same physical event from different camera angles and must infer the temporal offsets of Video 2 and Video 3 relative to Video 1. Kubric Spatial Measurement Template Observe the videos closely . At the exact moment the { event_object } completely exits the frame in Video { video_number } , which object is phys… view at source ↗
Figure 10
Figure 10. Figure 10: Kubric Spatial Measurement prompt template. The model is given two camera views and must identify which candidate object is closest to a target object in simulator-derived 3D space at a visually anchored event time. A.3 CLEVRER Generation Details CLEVRER provides frame-level annotations for object attributes, object trajectories, velocities, visibility, and collision events. For each split, we load the co… view at source ↗
Figure 11
Figure 11. Figure 11: CLEVRER Sequential Ordering prompt template. The model is given shuffled tempo￾ral segments from one continuous CLEVRER event and must recover their original chronological order. CLEVRER Kinematic Comparison Template Given two videos of CLEVRER physical interactions , identify which object has the highest peak velocity . Options : A ) { candidate_video_object_1 } B ) { candidate_video_object_2 } C ) { can… view at source ↗
Figure 12
Figure 12. Figure 12: CLEVRER Kinematic Comparison prompt template. The model must compare object motion across two CLEVRER videos and select the object with the highest simulator-derived peak velocity. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: CLEVRER Numerical Comparison prompt template. The model must compare collision counts across CLEVRER videos and identify both the video with the most collisions and the count difference relative to the second-highest video. B Experimental Details Evaluation Protocol. All models are evaluated in a zero-shot multiple-choice setting. Each bench￾mark example is stored as a JSONL record containing a natural-la… view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for evaluation. We use the same multiple-choice evaluation prompt for all SYNCR tasks. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) have made rapid progress in single-video understanding, yet their ability to reason across multiple independent video streams remains poorly understood. Existing multi-video benchmarks rely largely on human-annotated real-world footage, limiting the precision of spatial, temporal, and physical ground truth and making it difficult to diagnose model failures. We introduce SYNCR, a controlled synthetic benchmark for cross-video reasoning with programmatically verified grounding. Built using Habitat, Kubric, and CLEVRER simulator engines, SYNCR contains 8,163 multi-video question-answer pairs grounded in 9,650 unique videos. It evaluates MLLMs across eight tasks spanning four diagnostic pillars: Temporal Alignment, Spatial Tracking, Comparative Reasoning, and Holistic Synthesis. Our zero-shot evaluation of leading open- and closed-weight MLLMs reveals a substantial gap between current models and humans: the best model achieves only 52.5% average accuracy, compared to an 89.5% human baseline. Models perform relatively well on temporal ordering but struggle with precise physical and spatial reasoning, with the best model reaching only 26.0% accuracy on Kinematic Comparison. We further find that parameter scaling and reasoning-specialized post-training improve temporal alignment capabilities, but do not reliably address fine-grained physical tracking or global spatial synthesis. Finally, an exploratory sim-to-real correlation analysis suggests that several SYNCR tasks track model-level trends on real-world multi-video benchmarks, while also exposing reasoning capabilities underrepresented by existing evaluations. Code available at https://github.com/SaraGhazanfari/SYNCR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SYNCR, a synthetic benchmark for cross-video reasoning in MLLMs consisting of 8,163 programmatically verified QA pairs grounded in 9,650 videos generated via Habitat, Kubric, and CLEVRER engines. It defines eight tasks across four pillars (Temporal Alignment, Spatial Tracking, Comparative Reasoning, Holistic Synthesis) and reports zero-shot evaluations of leading MLLMs, with the best model at 52.5% average accuracy versus an 89.5% human baseline; models perform better on temporal ordering but poorly on physical/spatial tasks (e.g., 26.0% on Kinematic Comparison). The work also examines scaling and post-training effects and includes an exploratory sim-to-real correlation analysis linking several tasks to real multi-video trends.

Significance. If the synthetic construction and verification hold, SYNCR provides a controlled, scalable alternative to human-annotated real-world multi-video benchmarks, enabling precise isolation of reasoning failures in temporal, spatial, and physical domains that are difficult to diagnose otherwise. The reported 52.5% vs. 89.5% gap, task-specific breakdowns, and observation that scaling helps temporal alignment but not fine-grained tracking are actionable for model development. The public code release and sim-to-real exploratory analysis are strengths that support reproducibility and relevance.

major comments (2)
  1. [Evaluation section] Evaluation section: The central claim of a substantial gap (52.5% best-model average vs. 89.5% human) and the 26.0% Kinematic Comparison result depend on the zero-shot protocol; the manuscript must detail how multiple independent video streams are tokenized and presented to each MLLM (e.g., concatenation order, frame sampling, prompt templates) because input formatting choices can confound whether the gap reflects reasoning deficits or interface limitations.
  2. [Benchmark construction] Benchmark construction and task definitions: Programmatic verification across simulators is a key strength, yet the paper needs explicit pseudocode or rule sets for generating ground truth in the more complex pillars (Comparative Reasoning and Holistic Synthesis) to confirm that the tasks do not admit unintended shortcuts that models could exploit without true cross-video reasoning.
minor comments (3)
  1. [Abstract] Abstract: The statement that 'code available at https://github.com/SaraGhazanfari/SYNCR' should be accompanied by a note on whether the generated videos and QA pairs are also released, as this directly affects the benchmark's utility to the community.
  2. [Related work] Related work: The positioning against existing multi-video benchmarks would benefit from a table comparing scale, grounding precision, and task coverage rather than narrative description alone.
  3. [Figures] Figures and tables: Task example visualizations should explicitly annotate which video stream corresponds to which question component to improve readability when readers compare model failures across pillars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. The comments identify opportunities to strengthen clarity in the evaluation protocol and benchmark details, which we address point by point below.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The central claim of a substantial gap (52.5% best-model average vs. 89.5% human) and the 26.0% Kinematic Comparison result depend on the zero-shot protocol; the manuscript must detail how multiple independent video streams are tokenized and presented to each MLLM (e.g., concatenation order, frame sampling, prompt templates) because input formatting choices can confound whether the gap reflects reasoning deficits or interface limitations.

    Authors: We agree that a complete specification of the zero-shot input protocol is necessary to support the reported performance gap. The original manuscript outlines the overall zero-shot setup and model inputs at a high level but does not enumerate the precise formatting choices. In the revised manuscript we will expand the Evaluation section with a new subsection that specifies: (i) concatenation order (Video 1 followed by Video 2 with an explicit separator token), (ii) frame sampling (uniform sampling of at most 8 frames per video at 1 FPS, each resized to the model's native resolution), and (iii) the exact prompt templates used for each task family. These choices were held constant across all evaluated models. Adding this information will make clear that the observed deficits arise from reasoning limitations rather than interface artifacts. revision: yes

  2. Referee: [Benchmark construction] Benchmark construction and task definitions: Programmatic verification across simulators is a key strength, yet the paper needs explicit pseudocode or rule sets for generating ground truth in the more complex pillars (Comparative Reasoning and Holistic Synthesis) to confirm that the tasks do not admit unintended shortcuts that models could exploit without true cross-video reasoning.

    Authors: We concur that explicit rule sets for the more involved pillars improve transparency and help readers verify the absence of shortcuts. The original manuscript describes the overall programmatic verification pipeline and the four pillars at the task level but does not include pseudocode for Comparative Reasoning and Holistic Synthesis. In the revision we will add concise pseudocode and rule descriptions for these pillars (e.g., velocity-vector extraction and magnitude comparison for Kinematic Comparison; cross-video spatial-relation aggregation for Holistic Synthesis) in the Methods section or an appendix. Because the released code already implements these exact rules, the added text will simply make the paper self-contained while confirming that each task requires genuine integration of information across independent videos. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark creation and evaluation paper. It programmatically generates synthetic multi-video QA pairs using Habitat, Kubric, and CLEVRER simulators, evaluates leading MLLMs in zero-shot settings, and reports direct accuracy numbers against a human baseline (52.5% vs 89.5%). No mathematical derivations, fitted parameters, self-referential predictions, or load-bearing self-citations appear in the described content. The exploratory sim-to-real correlation is presented as supplementary observation rather than a deductive step that reduces to the paper's own inputs. The central claims rest on external model evaluations and human annotations, not on any internal redefinition or construction that would qualify as circular under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper's contribution rests on the assumption that simulator-based synthetic data can serve as reliable ground truth for reasoning evaluation; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Simulator engines (Habitat, Kubric, CLEVRER) produce accurate and verifiable spatial, temporal, and physical properties for multi-video scenarios.
    Invoked to justify the use of synthetic data for precise grounding instead of real-world footage.

pith-pipeline@v0.9.0 · 5599 in / 1169 out tokens · 62946 ms · 2026-05-12T00:48:12.457682+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 10 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report.ar...

  3. [3]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  4. [4]

    Emma: Efficient visual alignment in multi-modal llms.arXiv preprint arXiv:2410.02080, 2024

    Sara Ghazanfari, Alexandre Araujo, Prashanth Krishnamurthy, Siddharth Garg, and Farshad Khorrami. Emma: Efficient visual alignment in multi-modal llms.arXiv preprint arXiv:2410.02080, 2024

  5. [5]

    Towards unified benchmark and models for multi-modal perceptual metrics.arXiv preprint arXiv:2412.10594, 2024

    Sara Ghazanfari, Siddharth Garg, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Francesco Croce. Towards unified benchmark and models for multi-modal perceptual metrics.arXiv preprint arXiv:2412.10594, 2024

  6. [6]

    Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning

    Sara Ghazanfari, Francesco Croce, Nicolas Flammarion, Prashanth Krishnamurthy, Farshad Khorrami, and Siddharth Garg. Chain-of-frames: Advancing video understanding in multimodal llms via frame-aware reasoning.arXiv preprint arXiv:2506.00318, 2025

  7. [7]

    On the binding problem in artificial neural networks.arXiv preprint arXiv:2012.05208, 2020

    Klaus Greff, Sjoerd Van Steenkiste, and Jürgen Schmidhuber. On the binding problem in artificial neural networks.arXiv preprint arXiv:2012.05208, 2020

  8. [8]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022

  9. [9]

    Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers

    James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, and Romain Mueller. Lift-attend-splat: Bird’s-eye-view camera-lidar fusion using transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4526–4536, 2024

  10. [10]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2(3):440, 2018

  11. [11]

    Routledge, 2019

    Michael D Kirchhoff and Julian Kiverstein.Extended consciousness and predictive processing: A third wave view. Routledge, 2019

  12. [12]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27.Open Review, 62(1):1–62, 2022

  13. [13]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  14. [14]

    Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models

    Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, and Yao Hu. Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 6244–6252, 2026

  15. [15]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 10

  16. [16]

    Bev- former: Learning bird’s-eye-view representation from multi-camera im- ages via spatiotemporal transformers,

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal trans- formers.(2022).URL https://arxiv. org/abs/2203.17270, 10, 2022

  17. [17]

    Mvu-eval: Towards multi-video understanding evaluation for multimodal llms.arXiv preprint arXiv:2511.07250, 2025

    Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, et al. Mvu-eval: Towards multi-video understanding evaluation for multimodal llms.arXiv preprint arXiv:2511.07250, 2025

  18. [18]

    arXiv preprint arXiv:2310.13724 (2023) 14

    Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. Habitat 3.0: A co-habitat for humans, avatars and robots.arXiv preprint arXiv:2310.13724, 2023

  19. [19]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021

  20. [20]

    A simple neural network module for relational reasoning.Advances in neural information processing systems, 30, 2017

    Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning.Advances in neural information processing systems, 30, 2017

  21. [21]

    Habitat: A platform for embodied ai research

    Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied ai research. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9339–9347, 2019

  22. [22]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaugh- lin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  23. [23]

    Core knowledge.Developmental science, 10(1):89–96, 2007

    Elizabeth S Spelke and Katherine D Kinzler. Core knowledge.Developmental science, 10(1):89–96, 2007

  24. [24]

    Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

  25. [25]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  26. [26]

    Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

    Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python.Nature methods, 17(3):261–272, 2020

  27. [27]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  28. [28]

    Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance

    Gang Wu, Yi Wu, Long Jiao, Yuan-Fang Wang, and Edward Y Chang. Multi-camera spatio-temporal fusion and biased sequence-data learning for security surveillance. InProceedings of the eleventh ACM international conference on Multimedia, pages 528–538, 2003

  29. [29]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37: 28828–28857, 2024

  30. [30]

    arXiv preprint arXiv:2404.16994 , year=

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava: Parameter-free llava extension from images to videos for video dense captioning.arXiv preprint arXiv:2404.16994, 2024

  31. [31]

    Clevrer: Collision events for video representation and reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

  32. [32]

    MLVU: Benchmarking Multi-task Long Video Understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. Mlvu: A comprehensive benchmark for multi-task long video understanding.arXiv preprint arXiv:2406.04264, 2(5):6, 2024

  33. [33]

    Cvbench: Benchmarking cross-video synergies for complex multimodal reasoning.arXiv preprint arXiv:2508.19542, 2025

    Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, et al. Cvbench: Benchmarking cross-video synergies for complex multimodal reasoning.arXiv preprint arXiv:2508.19542, 2025. 11 A Dataset Generation Details This section provides additional implementation details for the SYNCR data g...

  34. [34]

    most and second most have the same number of collisions – 0

    Options : A ) { o f f s e t _ o p t i o n _ 1 } B ) { o f f s e t _ o p t i o n _ 2 } C ) { o f f s e t _ o p t i o n _ 3 } D ) { o f f s e t _ o p t i o n _ 4 } Figure 9:Kubric Multi-Angle Synchronization prompt template.The model is given three cropped videos of the same physical event from different camera angles and must infer the temporal offsets of ...