pith. machine review for the scientific record. sign in

arxiv: 2604.20460 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

CCTVBench: Contrastive Consistency Traffic VideoQA Benchmark for Multimodal LLMs

Alois Knoll, Andr\'e Schamschurko, Hao Guo, Hu Cao, Mingyu Liu, Rui Song, Walter Zimmer, Xingcheng Zhou

Pith reviewed 2026-05-10 01:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords contrastive consistencytraffic video QAmultimodal LLMscounterfactual videosvideo question answeringCCTVBenchcontrastive decodingsafety-critical reasoning
0
0 comments X

The pith

Video LLMs show a large gap between standard QA scores and contrastive consistency on traffic safety questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety-critical traffic reasoning requires models to detect real hazards while reliably rejecting false but plausible alternatives under nearly identical scenes. It introduces CCTVBench to enforce this by pairing real accident videos with world-model-generated counterfactuals and structuring each test as a quadruple of minimally different, mutually exclusive questions. Experiments across multiple models find that conventional per-instance accuracy greatly overstates reliability, with weak rejection of incorrect options as the dominant failure. The work also presents C-TCD, which applies contrastive decoding using the paired counterfactual video to raise both accuracy and consistency.

Core claim

CCTVBench tests multimodal LLMs on traffic video question answering by requiring a single consistent decision pattern across each quadruple of real and counterfactual videos together with exclusive hypotheses, exposing that models frequently omit true hazards, swap answers, hallucinate, or violate mutual exclusivity even when they pass isolated questions, while contrastive decoding that treats the exclusive counterpart video as negative input during inference improves both instance-level QA and quadruple consistency.

What carries the argument

CCTVBench, a benchmark of video-question quadruples that pairs each real accident video with a minimally different counterfactual counterpart and enforces structured, mutually exclusive answer choices over the set.

If this is right

  • Standard per-instance QA metrics substantially overestimate reliability for safety-critical traffic tasks.
  • Unreliable rejection of none-of-the-above options forms a central bottleneck in current video LLMs.
  • Failure types can be isolated into positive omission, positive swap, negative hallucination, and mutual-exclusivity violations.
  • Contrastive decoding that uses a semantically exclusive counterpart video raises both instance accuracy and quadruple-level consistency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar quadruple-based contrastive tests could be constructed for other safety domains such as medical video or autonomous navigation to surface comparable consistency gaps.
  • If the generated counterfactuals prove sufficiently realistic, the benchmark structure offers a scalable way to probe whether models have learned causal distinctions rather than surface patterns.
  • Training objectives that directly penalize inconsistent answers across near-identical inputs may be needed to close the observed gap at the model level.

Load-bearing premise

The world-model-generated counterfactual videos are realistic enough and differ from the real videos only in the intended minimal respects without artifacts that change how the models respond.

What would settle it

Evaluating the same models on a collection of real paired traffic videos that differ only by the presence or absence of an accident would show whether the reported consistency gap remains or shrinks when generation artifacts are removed.

Figures

Figures reproduced from arXiv: 2604.20460 by Alois Knoll, Andr\'e Schamschurko, Hao Guo, Hu Cao, Mingyu Liu, Rui Song, Walter Zimmer, Xingcheng Zhou.

Figure 1
Figure 1. Figure 1: Illustration of the contrastive quadruple eval [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset curation pipeline of CCTVBench. QA pair follows a fixed four-way decision pat￾tern: on the positive video v + s , the positive question q + s,k should be answered Yes while its counterpart q − s,k should be No; on the counterfactual negative￾control video v − s , both questions should be an￾swered No. Finally, we conduct human validation and remove samples with generation artifacts, se￾mantic misma… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Distribution of event types from various [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quadruple Example of CCTVBench with contrastive video pair [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Normalized four failure-mode composition [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Category-wise QuadAcc radar plots for (a) small, (b) medium, and (c) large/commercial model groups. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CCTVBench dataset statistics [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Safe prompts for counterfactual video synthe [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: reveal that contrastive decoding improves standard QA scores but is not always aligned with [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Category-wise classification score (Score [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Category-wise question consistency radar plots for small, medium, and large/commercial model groups. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Category-wise video consistency radar plots for small, medium, and large/commercial model groups. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Effect of contrast strength α on Qwen3-VL-2B under VCD, TCD, and C-TCD. We report video consistency, question consistency, balanced accuracy, and QuadAcc. front–behind grounding under near-identical views, where models may change answers across videos but in the wrong direction [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Qualitative example 1: Intersection T-bone collision between the black car and the ego vehicle. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Qualitative example 2: Snowy-road turning scenario with loss of control collision. [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Qualitative example 3: Urban scenario with a head-on collision between a red motorcycle and ego [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
read the original abstract

Safety-critical traffic reasoning requires contrastive consistency: models must detect true hazards when an accident occurs, and reliably reject plausible-but-false hypotheses under near-identical counterfactual scenes. We present CCTVBench, a Contrastive Consistency Traffic VideoQA Benchmark built on paired real accident videos and world-model-generated counterfactual counterparts, together with minimally different, mutually exclusive hypothesis questions. CCTVBench enforces a single structured decision pattern over each video question quadruple and provides actionable diagnostics that decompose failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation, while separating video versus question consistency. Experiments across open-source and proprietary video LLMs reveal a large and persistent gap between standard per-instance QA metrics and quadruple-level contrastive consistency, with unreliable none-of-the-above rejection as a key bottleneck. Finally, we introduce C-TCD, a contrastive decoding approach leveraging a semantically exclusive counterpart video as the contrast input at inference time, improving both instance-level QA and contrastive consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CCTVBench, a benchmark for multimodal LLMs on traffic video QA that pairs real accident videos with world-model-generated counterfactual counterparts and supplies minimally different, mutually exclusive hypothesis questions. It defines a structured quadruple decision pattern and decomposes failures into positive omission, positive swap, negative hallucination, and mutual-exclusivity violation (separating video vs. question consistency). Experiments across open- and closed-source video LLMs demonstrate a gap between standard per-instance QA accuracy and quadruple-level contrastive consistency, identify unreliable none-of-the-above rejection as a bottleneck, and propose C-TCD, a contrastive decoding method that uses a semantically exclusive counterpart video at inference time to improve both metrics.

Significance. If the counterfactual videos satisfy the minimality and realism assumptions, the work supplies a useful diagnostic lens for safety-critical video reasoning and a practical inference-time mitigation. The structured failure taxonomy and explicit separation of video/question consistency are constructive contributions beyond standard VQA metrics.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: the claim that world-model-generated counterfactuals are 'minimally different' and free of artifacts that could affect model behavior is load-bearing for attributing the reported gap to absence of contrastive reasoning rather than low-level visual discrepancies (e.g., physics, lighting, trajectories). No quantitative controls (perceptual similarity, optical-flow divergence, or human validation scores) or explicit artifact audit are supplied to verify this minimality.
  2. [Experiments] Experiments section: the central claim of a 'large and persistent gap' between per-instance QA metrics and quadruple-level contrastive consistency is presented without tabulated numerical values, model-by-model breakdowns, or statistical tests in the provided description; this prevents verification that the gap is not an artifact of the chosen quadruple construction or evaluation protocol.
minor comments (2)
  1. [C-TCD] The C-TCD decoding procedure would benefit from an explicit algorithmic listing or pseudocode showing how the contrast video is chosen and how the final answer is aggregated.
  2. [Diagnostics] Notation for the four failure modes and the quadruple consistency score should be introduced with a compact mathematical definition rather than prose alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We appreciate the positive assessment of the benchmark's diagnostic value for safety-critical video reasoning and the structured failure taxonomy. We address the two major comments below, committing to revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the claim that world-model-generated counterfactuals are 'minimally different' and free of artifacts that could affect model behavior is load-bearing for attributing the reported gap to absence of contrastive reasoning rather than low-level visual discrepancies (e.g., physics, lighting, trajectories). No quantitative controls (perceptual similarity, optical-flow divergence, or human validation scores) or explicit artifact audit are supplied to verify this minimality.

    Authors: We agree that explicit verification of minimality is important for attributing failures to reasoning rather than low-level visual differences. The manuscript describes the world-model generation process and the design of mutually exclusive hypotheses, but does not include quantitative controls such as perceptual similarity metrics, optical-flow divergence, or human validation scores. In the revision we will add a dedicated paragraph with human annotator ratings on visual similarity and realism (collected on a subset of pairs) together with automated checks (e.g., SSIM and average optical-flow magnitude) to document that discrepancies remain below thresholds that would plausibly alter model behavior. This addition will directly support the load-bearing claim. revision: yes

  2. Referee: [Experiments] Experiments section: the central claim of a 'large and persistent gap' between per-instance QA metrics and quadruple-level contrastive consistency is presented without tabulated numerical values, model-by-model breakdowns, or statistical tests in the provided description; this prevents verification that the gap is not an artifact of the chosen quadruple construction or evaluation protocol.

    Authors: The full manuscript already contains model-by-model numerical results for both standard QA accuracy and quadruple-level contrastive consistency (Table 2 and the accompanying figure), showing a consistent gap of roughly 25–35 points across open- and closed-source models. To address the concern that the gap might be protocol-dependent, we will add (i) explicit per-model tabulated values with standard deviations, (ii) a paired statistical test (Wilcoxon signed-rank) confirming the gap is significant, and (iii) an ablation on quadruple construction variants. These clarifications will be placed in the main experiments section rather than supplementary material. revision: partial

Circularity Check

0 steps flagged

No significant circularity; benchmark construction and C-TCD are independent empirical contributions

full rationale

The paper defines CCTVBench via externally sourced real accident videos paired with world-model-generated counterfactuals, then measures empirical gaps in model performance on quadruple-level contrastive consistency and introduces C-TCD as a separate inference-time decoding technique. No equations, fitted parameters, or self-citations reduce the reported metrics or improvements to quantities defined by the benchmark itself; the contrastive diagnostics and consistency scores are computed directly from model outputs on the held-out pairs without tautological re-use of the same data for both definition and prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper's central contributions rest on the new benchmark definition and the C-TCD method; they rely on one domain assumption about counterfactual generation quality and introduce two new entities without independent evidence outside this work.

axioms (1)
  • domain assumption World models can generate realistic counterfactual traffic scenes that are minimally different from real videos yet semantically exclusive for hypothesis testing
    Invoked to justify the paired video construction and contrastive evaluation in the benchmark.
invented entities (2)
  • CCTVBench no independent evidence
    purpose: Benchmark enforcing contrastive consistency via video-question quadruples and specific failure diagnostics
    Newly defined in this paper; no prior existence claimed.
  • C-TCD no independent evidence
    purpose: Contrastive decoding that uses a semantically exclusive counterpart video as additional input at inference time
    Newly proposed method in this paper.

pith-pipeline@v0.9.0 · 5493 in / 1336 out tokens · 99348 ms · 2026-05-10T01:27:27.255176+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 11 canonical work pages

  1. [1]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Guan, Tianrui and Liu, Fuxiao and Wu, Xiyang and Xian, Ruiqi and Li, Zongxia and Liu, Xiaoyu and Wang, Xijun and Chen, Lichang and Huang, Furong and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  2. [2]

    Aligning large multimodal models with factually augmented rlhf.arXiv preprint arXiv:2309.14525, 2023

    Aligning large multimodal models with factually augmented rlhf , author=. arXiv preprint arXiv:2309.14525 , year=

  3. [3]

    2023 , eprint=

    Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges , author=. 2023 , eprint=

  4. [4]

    2024 , eprint=

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation , author=. 2024 , eprint=

  5. [5]

    Unified Hallucination Detection for Multimodal Large Language Models

    Chen, Xiang and Wang, Chenxi and Xue, Yida and Zhang, Ningyu and Yang, Xiaoyan and Li, Qiang and Shen, Yue and Liang, Lei and Gu, Jinjie and Chen, Huajun. Unified Hallucination Detection for Multimodal Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/...

  6. [6]

    Visual Hallucinations of Multi-modal Large Language Models

    Huang, Wen and Liu, Hongbin and Guo, Minxin and Gong, Neil. Visual Hallucinations of Multi-modal Large Language Models. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.573

  7. [7]

    2025 , booktitle =

    PhD: A ChatGPT-Prompted Visual hallucination Evaluation Dataset , author=. 2025 , booktitle =

  8. [8]

    2024 , eprint=

    VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models , author=. 2024 , eprint=

  9. [9]

    arxiv , year=

    VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models , author=. arxiv , year=

  10. [10]

    2025 , url=

    Xingcheng Zhou and Konstantinos Larintzakis and Hao Guo and Walter Zimmer and Mingyu Liu and Hu Cao and Jiajie Zhang and Venkatnarayanan Lakshminarasimhan and Leah Strand and Alois Knoll , booktitle=. 2025 , url=

  11. [11]

    arXiv preprint arXiv:2312.14115 , year=

    LingoQA: Visual Question Answering for Autonomous Driving , author=. arXiv preprint arXiv:2312.14115 , year=

  12. [12]

    arXiv preprint arXiv:2305.14836 , year=

    NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario , author=. arXiv preprint arXiv:2305.14836 , year=

  13. [13]

    2025 , eprint=

    InterAct-Video: Reasoning-Rich Video QA for Urban Traffic , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    MITS: A Large-Scale Multimodal Benchmark Dataset for Intelligent Traffic Surveillance , author=. 2025 , eprint=

  15. [15]

    , journal=

    Zhou, Xingcheng and Liu, Mingyu and Yurtsever, Ekim and Zagar, Bare Luka and Zimmer, Walter and Cao, Hu and Knoll, Alois C. , journal=. Vision Language Models in Autonomous Driving: A Survey and Outlook , year=

  16. [16]

    2026 , eprint=

    LLM4AD: Large Language Models for Autonomous Driving -- Concept, Review, Benchmark, Experiments, and Future Trends , author=. 2026 , eprint=

  17. [17]

    2023 , eprint=

    Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding , author=. 2023 , eprint=

  18. [18]

    2025 , eprint=

    EventHallusion: Diagnosing Event Hallucinations in Video LLMs , author=. 2025 , eprint=

  19. [19]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Xu, Li and Huang, He and Liu, Jun , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2021 , pages =

  20. [20]

    arXiv preprint arXiv:2312.14150 , year=

    DriveLM: Driving with Graph Visual Question Answering , author=. arXiv preprint arXiv:2312.14150 , year=

  21. [21]

    2025 , eprint=

    VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding , author=. 2025 , eprint=

  22. [22]

    2024 , eprint=

    Certainly Uncertain: A Benchmark and Metric for Multimodal Epistemic and Aleatoric Awareness , author=. 2024 , eprint=

  23. [23]

    2020 , eprint=

    VIOLIN: A Large-Scale Dataset for Video-and-Language Inference , author=. 2020 , eprint=

  24. [24]

    2025 , eprint=

    GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving , author=. 2025 , eprint=

  25. [25]

    2020 , eprint=

    What is More Likely to Happen Next? Video-and-Language Future Event Prediction , author=. 2020 , eprint=

  26. [26]

    TVQA +: Spatio-Temporal Grounding for Video Question Answering

    Lei, Jie and Yu, Licheng and Berg, Tamara and Bansal, Mohit. TVQA +: Spatio-Temporal Grounding for Video Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.730

  27. [27]

    Driving Into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving , year=

    Wang, Yuqi and He, Jiawei and Fan, Lue and Li, Hongxin and Chen, Yuntao and Zhang, Zhaoxiang , booktitle=. Driving Into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving , year=

  28. [28]

    2024 , eprint=

    Open-Sora: Democratizing Efficient Video Production for All , author=. 2024 , eprint=

  29. [29]

    2025 , eprint=

    Cosmos World Foundation Model Platform for Physical AI , author=. 2025 , eprint=

  30. [30]

    Gao, Ruiyuan and Chen, Kai and Xie, Enze and Hong, Lanqing and Li, Zhenguo and Yeung, Dit-Yan and Xu, Qiang , booktitle =

  31. [31]

    2023 , eprint=

    DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model , author=. 2023 , eprint=

  32. [32]

    2025 , eprint=

    VidHal: Benchmarking Temporal Hallucinations in Vision LLMs , author=. 2025 , eprint=

  33. [33]

    2025 , eprint=

    Qwen2.5-VL Technical Report , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding , author=. 2025 , eprint=

  35. [35]

    2024 , eprint=

    Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models , author=. 2024 , eprint=

  36. [36]

    2024 , eprint=

    PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning , author=. 2024 , eprint=

  37. [37]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  38. [38]

    2025 , eprint=

    STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving , author=. 2025 , eprint=

  39. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Can i trust your answer? visually grounded video question answering , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  40. [40]

    2025 , eprint=

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency , author=. 2025 , eprint=

  41. [41]

    2025 , eprint=

    Qwen3-VL Technical Report , author=. 2025 , eprint=

  42. [42]

    2025 , eprint=

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

  43. [43]

    Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision

    Lee, Seongyun and Park, Sue Hyun and Jo, Yongrae and Seo, Minjoon. Volcano: Mitigating Multimodal Hallucination through Self-Feedback Guided Revision. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.23

  44. [44]

    Evaluating Object Hallucination in Large Vision-Language Models

    Li, Yifan and Du, Yifan and Zhou, Kun and Wang, Jinpeng and Zhao, Xin and Wen, Ji-Rong. Evaluating Object Hallucination in Large Vision-Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.20

  45. [45]

    ACM Multimedia Conference , month =

    Bao, Wentao and Yu, Qi and Kong, Yu , title =. ACM Multimedia Conference , month =

  46. [46]

    Dada-2000: Can driving accident be predicted by driver attention ƒ analyzed by a benchmark

    Fang, Jianwu and Yan, Dingxin and Qiao, Jiahuan and Xue, Jianru and Wang, He and Li, Sen , title =. 2019 IEEE Intelligent Transportation Systems Conference (ITSC) , pages =. 2019 , publisher =. doi:10.1109/ITSC.2019.8917218 , abstract =

  47. [47]

    Asian Conference on Computer Vision , pages=

    Anticipating accidents in dashcam videos , author=. Asian Conference on Computer Vision , pages=. 2016 , organization=

  48. [48]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  49. [49]

    EGOILLUSION : Benchmarking Hallucinations in Egocentric Video Understanding

    Seth, Ashish and Tyagi, Utkarsh and Selvakumar, Ramaneswaran and Anand, Nishit and Kumar, Sonal and Ghosh, Sreyan and Duraiswami, Ramani and Agarwal, Chirag and Manocha, Dinesh. EGOILLUSION : Benchmarking Hallucinations in Egocentric Video Understanding. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.1...

  50. [50]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

    Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

  51. [51]

    2024 , eprint=

    RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer , author=. 2024 , eprint=

  52. [52]

    Proceedings of the European Conference on Computer Vision (ECCV) , year=

    ByteTrack: Multi-Object Tracking by Associating Every Detection Box , author=. Proceedings of the European Conference on Computer Vision (ECCV) , year=

  53. [53]

    2024 , eprint=

    GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events , author=. 2024 , eprint=