Recognition: unknown
Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
Pith reviewed 2026-05-08 10:11 UTC · model grok-4.3
The pith
Event-Causal RAG organizes long videos into causal event graphs to support reasoning over extended temporal gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Event-Causal RAG segments videos into semantically coherent events stored as State-Event-State graphs in a dual-store memory system. It uses causal-topological retrieval to provide relevant event chains and video evidence to a foundation model, leading to superior performance on benchmarks for multi-event causal reasoning in long videos.
What carries the argument
The State-Event-State (SES) graph, which represents each event along with its preceding and following states to capture transitions, combined with the Event Knowledge Graph for global causal structure and dual-store memory for efficient retrieval.
Load-bearing premise
That automatically segmenting videos into semantically coherent events and modeling them as State-Event-State graphs will accurately capture causal dependencies without segmentation errors that affect retrieval and reasoning.
What would settle it
A test where videos have ambiguous or overlapping events leading to poor segmentation, showing if the method underperforms clip-based baselines on causal inference tasks.
Figures
read the original abstract
Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Event-Causal RAG, a lightweight retrieval-augmented framework for ultra-long or infinite video reasoning. Videos are segmented into semantically coherent events, each represented as a State-Event-State (SES) graph that captures the event and surrounding state transitions; these are merged into a global Event Knowledge Graph stored in a dual-store memory supporting semantic matching and causal-topological retrieval. A bidirectional retrieval strategy then supplies relevant event causal chains plus video evidence to a backbone video foundation model. The authors claim consistent outperformance over clip-based retrieval baselines and long-context video models on long-video benchmarks, especially for multi-event integration and causal inference across temporal gaps, together with gains in memory efficiency and streaming robustness.
Significance. If the experimental claims hold after proper validation, the work could advance long-video understanding by replacing quadratic self-attention and fragmented clip memory with structured event-level causal modeling, offering a practical path toward coherent reasoning over extended temporal spans.
major comments (2)
- [Abstract] Abstract: the claim of consistent outperformance on long-video understanding benchmarks is presented without any quantitative results, baseline names, dataset identifiers, or ablation studies, preventing direct assessment of the magnitude or reliability of the reported gains.
- [Method] Method description (SES graph and Event Knowledge Graph construction): the central experimental claim for superior performance on multi-event causal questions rests on the premise that event segmentation into SES graphs reliably encodes causal dependencies without propagating segmentation errors into the global graph or retrieval indices; however, the manuscript supplies no segmentation accuracy metrics, error-propagation analysis, or ablation isolating the contribution of the SES structure versus simpler clip-level retrieval.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of consistent outperformance on long-video understanding benchmarks is presented without any quantitative results, baseline names, dataset identifiers, or ablation studies, preventing direct assessment of the magnitude or reliability of the reported gains.
Authors: We agree that the abstract would be more informative with concrete details. In the revised manuscript we will expand the abstract to include specific quantitative gains (e.g., accuracy deltas on named long-video QA benchmarks), the identities of the clip-based retrieval baselines and long-context video models, and a brief reference to the ablation studies. These results are already reported in Section 4; we will simply surface the most salient numbers and identifiers in the abstract. revision: yes
-
Referee: [Method] Method description (SES graph and Event Knowledge Graph construction): the central experimental claim for superior performance on multi-event causal questions rests on the premise that event segmentation into SES graphs reliably encodes causal dependencies without propagating segmentation errors into the global graph or retrieval indices; however, the manuscript supplies no segmentation accuracy metrics, error-propagation analysis, or ablation isolating the contribution of the SES structure versus simpler clip-level retrieval.
Authors: We accept that the current version lacks explicit segmentation accuracy metrics, error-propagation analysis, and a dedicated ablation of SES versus clip-level retrieval. We will add a new subsection (or appendix) containing: (i) segmentation accuracy measured against human-annotated event boundaries on a held-out subset, (ii) a qualitative and quantitative discussion of error propagation together with the mitigation provided by the dual-store memory and bidirectional retrieval, and (iii) an ablation that directly compares the full SES-based pipeline against a simpler clip-level retrieval baseline. These additions will be included in the revised paper. revision: yes
Circularity Check
No circularity; engineering framework with independent empirical claims.
full rationale
The paper describes Event-Causal RAG as a new retrieval-augmented architecture that segments videos into State-Event-State graphs, builds a global Event Knowledge Graph, and uses dual-store memory with bidirectional causal-topological retrieval. No equations, fitted parameters, or derivations appear in the provided text. The method is presented as an explicit construction rather than a reduction of any claimed prediction or uniqueness result to its own inputs. Experimental outperformance is asserted via benchmark comparisons without self-citation chains or ansatzes that smuggle in the target behavior. This is a standard descriptive systems paper whose validity rests on external evaluation, not internal self-reference.
Axiom & Free-Parameter Ledger
invented entities (2)
-
State-Event-State (SES) graph
no independent evidence
-
Event Knowledge Graph
no independent evidence
Reference graph
Works this paper leans on
-
[1]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
MA-LMM: Memory-augmented large multimodal model for long-term video understanding
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. MA-LMM: Memory-augmented large multimodal model for long-term video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13504–13514, 2024
2024
-
[3]
Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, and Michael S. Ryoo. Understanding long videos with multimodal language models. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[4]
Heqing Zou, Tianze Luo, Guiyang Xie, Victor Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, and Huaijian Zhang. From seconds to hours: Reviewing multimodal large language models on comprehensive long video understanding.arXiv preprint arXiv:2409.18938, 2024
-
[5]
Video-XL: Extra-long vision language model for hour-scale video understanding
Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-XL: Extra-long vision language model for hour-scale video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26160–26169, 2025
2025
-
[6]
LongVLM: Efficient long video understanding via large language models
Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, and Bohan Zhuang. LongVLM: Efficient long video understanding via large language models. InProceedings of the European Conference on Computer Vision (ECCV), pages 453–470, 2024
2024
-
[7]
EgoSchema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track, 2023
2023
-
[8]
LongVideoBench: A benchmark for long-context interleaved video-language understanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-language understanding. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, 2024
2024
-
[9]
Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Caifeng Shan, Ran He, and Xing Sun. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InP...
2025
-
[10]
Towards event-oriented long video understanding, 2024
Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, and Ji-Rong Wen. Towards event-oriented long video understanding, 2024
2024
-
[11]
Yumeng Shi, Quanyu Long, Yin Wu, and Wenya Wang. Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2026. AAAI 2026. 10
-
[12]
EventV AD: Training-free event-aware video anomaly detection
Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, and Shuyan Li. EventV AD: Training-free event-aware video anomaly detection. InProceedings of the 33rd ACM International Conference on Multimedia (MM), pages 2586–2595, 2025
2025
-
[13]
Deep bilstm attention model for spatial and temporal anomaly detection in video surveillance
Sarfaraz Natha, Fareed Ahmed, Mohammad Siraj, Mehwish Lagari, Majid Altamimi, and Asghar Ali Chandio. Deep bilstm attention model for spatial and temporal anomaly detection in video surveillance. Sensors, 25(1):251, 2025
2025
-
[14]
Anomaly detection in traffic surveillance videos using deep learning.Sensors, 22(17), 2022
Sardar Waqar Khan, Qasim Hafeez, Muhammad Irfan Khalid, Roobaea Alroobaea, Saddam Hussain, Jawaid Iqbal, Jasem Almotiri, and Syed Sajid Ullah. Anomaly detection in traffic surveillance videos using deep learning.Sensors, 22(17), 2022
2022
-
[15]
Ketan Pawar and Vahida Z. Attar. Deep learning based detection and localization of road accidents from traffic surveillance videos.ICT Express, 8(3):379–387, 2022
2022
-
[16]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017
2017
-
[17]
Hongchen Wei and Zhenzhong Chen. Visual context window extension: A new perspective for long video understanding.arXiv preprint arXiv:2409.20018, 2024
-
[18]
V2PE: Improving multimodal long-context capability of vision-language models with variable visual position encoding
Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, and Xizhou Zhu. V2PE: Improving multimodal long-context capability of vision-language models with variable visual position encoding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 21070–21084, 2025
2025
-
[19]
Model tells you what to discard: Adaptive kv cache compression for llms
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. InInternational Conference on Learning Representations (ICLR), 2024. Oral
2024
-
[20]
MiniCache: Kv cache compression in depth dimension for large language models
Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. MiniCache: Kv cache compression in depth dimension for large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Main Conference Track, 2024
2024
-
[21]
InfiniPot-V: Memory-constrained kv cache compression for streaming video understanding
Minsoo Kim, Kyuhong Shim, Jungwook Choi, and Simyung Chang. InfiniPot-V: Memory-constrained kv cache compression for streaming video understanding. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[22]
Jincheng Dai, Zhuowei Huang, Haiyun Jiang, Chen Chen, Deng Cai, Wei Bi, and Shuming Shi. CORM: Cache optimization with recent message for large language model inference.arXiv preprint arXiv:2404.15949, 2024
-
[23]
Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference
Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. InFindings of the Association for Computational Linguistics: ACL 2024, pages 3258–3270, 2024
2024
-
[24]
CacheGen: Kv cache compression and streaming for fast large language model serving
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. CacheGen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM Conference, 2024
2024
-
[25]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024
2024
-
[26]
Rafael Souza, Jia-Hao Lim, and Alexander Davis. Temporal contrastive learning for video temporal reasoning in large vision-language models.arXiv preprint arXiv:2412.11391, 2024
-
[27]
MECD: Unlocking multi-event causal discovery in video reasoning
Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, Chaofan Gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, and Weiyao Lin. MECD: Unlocking multi-event causal discovery in video reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. Spotlight paper
2024
-
[28]
VideoRAG: Retrieval-augmented generation over video corpus
Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval-augmented generation over video corpus. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21278–21298, Vienna, Austria, 2025. Association for Computational Linguistics. 11
2025
-
[29]
Video-RAG: Visually-aligned retrieval-augmented long video comprehension
Yongdong Luo, Xiawu Zheng, Guilin Li, Shukang Yin, Haojia Lin, Chaoyou Fu, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. Video-RAG: Visually-aligned retrieval-augmented long video comprehension. InAdvances in Neural Information Processing Systems (NeurIPS), 2025
2025
-
[30]
Videorag: Retrieval-augmented generation with extreme long-context videos, 2025
Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval- augmented generation with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025
-
[31]
MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation
Chi-Hsiang Hsiao, Yi-Cheng Wang, Tzung-Sheng Lin, Yi-Ren Yeh, and Chu-Song Chen. MegaRAG: Multimodal knowledge graph-based retrieval augmented generation.arXiv preprint arXiv:2512.20626,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Listed as ACL 2026 in the arXiv metadata available during verification
2026
-
[33]
Ask in any modality: A comprehensive survey on multimodal retrieval-augmented generation
Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, and Ehsaneddin Asgari. Ask in any modality: A comprehensive survey on multimodal retrieval-augmented generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 16776–16809, Vi...
2025
-
[34]
MMA-RAG: A survey on multimodal agentic retrieval-augmented generation
Vladana Perlic, Stephane Lebailly, Vadim Malvone, Van-Tam Nguyen, and Pascal Urard. MMA-RAG: A survey on multimodal agentic retrieval-augmented generation. SSRN preprint / HAL preprint hal-05322313, 2025
2025
-
[35]
Action scene graphs for long-form understanding of egocentric videos
Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, and Giovanni Maria Farinella. Action scene graphs for long-form understanding of egocentric videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18622–18632, 2024
2024
-
[36]
NExT-QA: Next phase of question-answering to explaining temporal actions
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021
2021
-
[37]
Multimodal event causality reasoning with scene graph enhanced interaction network.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8778– 8786, 2024
Jintao Liu, Kaiwen Wei, and Chenglong Liu. Multimodal event causality reasoning with scene graph enhanced interaction network.Proceedings of the AAAI Conference on Artificial Intelligence, 38(8):8778– 8786, 2024
2024
-
[38]
Shuming Liu, Lin Sui, Chen-Lin Zhang, Fangzhou Mu, Chen Zhao, and Bernard Ghanem. Harnessing temporal causality for advanced temporal action detection.arXiv preprint arXiv:2407.17792, 2024
-
[39]
Streaming long video understanding with large language models
Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, and Jiaqi Wang. Streaming long video understanding with large language models. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Main Conference Track, 2024
2024
-
[40]
Stream- ingVLM: Real-time understanding for infinite video streams
Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, and Song Han. Stream- ingVLM: Real-time understanding for infinite video streams. InInternational Conference on Learning Representations (ICLR), 2026
2026
-
[41]
Learning streaming video representation via multitask training
Yibin Yan, Jilan Xu, Shangzhe Di, Yikun Liu, Yudi Shi, Qirui Chen, Zeqian Li, Yifei Huang, and Weidi Xie. Learning streaming video representation via multitask training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9900–9912, 2025
2025
-
[42]
Streaming videollms for real-time procedural video understanding
Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Ci- han Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, and Fadime Sener. Streaming videollms for real-time procedural video understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22586–22598, 2025
2025
-
[43]
Streaming video understanding and multi-round interaction with memory-enhanced knowledge
Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowledge. In International Conference on Learning Representations (ICLR), 2025
2025
-
[44]
Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, and Ying Tiffany He. Scenerag: Scene-level retrieval- augmented generation for video understanding.arXiv preprint arXiv:2506.07600, 2025
-
[45]
Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding
Xiaoqian Shen, Wenxuan Zhang, Jun Chen, and Mohamed Elhoseiny. Vgent: Graph-based retrieval- reasoning-augmented generation for long video understanding. InAdvances in Neural Information Processing Systems, 2025. Spotlight
2025
-
[46]
ViG-RAG: Video-aware graph retrieval-augmented generation via temporal and semantic hybrid reasoning
Zongsheng Cao, Anran Liu, Yangfan He, Jing Li, Bo Zhang, and Zigan Wang. ViG-RAG: Video-aware graph retrieval-augmented generation via temporal and semantic hybrid reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 48–56, 2026. 12
2026
-
[47]
Temporal chain of thought: Long-video understanding by thinking in frames, 2025
Anurag Arnab, Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Temporal chain of thought: Long-video understanding by thinking in frames.arXiv preprint arXiv:2507.02001, 2025
-
[48]
Video-of-thought: Step-by-step video reasoning from perception to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, and Wynne Hsu. Video-of-thought: Step-by-step video reasoning from perception to cognition. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 13109–13125. PMLR, 2024
2024
-
[49]
Extract SES
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023. 13 A Implementation Details A.1 A. Algorithm Pseudocode Algorithm 1:EC-RAG: Spatiotemporal Graph Construction and Retrieval Strategy Input:Streaming videoV...
2023
-
[50]
- DO NOT use vague umbrella terms (e.g., ’performing’)
Task-Level Physical Actions (The ’Goldilocks’ Granularity): Describe specific, observable physical tasks and object interactions... - DO NOT use vague umbrella terms (e.g., ’performing’). - DO NOT over-decompose into meaningless joint kinematics
-
[51]
You MUST refer to entities by distinct visual attributes (e.g., write ’The man in the black t-shirt’, NOT ’E1’)
Visual Attribute Injection (CRITICAL): NEVER use generic IDs or pronouns. You MUST refer to entities by distinct visual attributes (e.g., write ’The man in the black t-shirt’, NOT ’E1’)
-
[52]
Micro-Detail Exhaustion: Capture secondary background events, holding props, and screen text
-
[53]
If text is blurry or uncertain, output an empty string
Direct Visual Evidence (CRITICAL): Preserve only directly visible evidence. If text is blurry or uncertain, output an empty string
-
[54]
Strict Causality: An ’Event’ is a directed edge bridging a ’Pre-State’ and ’Post-State’
-
[55]
[USER PROMPT TEMPLATE] Target Timestamp: {start_time} - {end_time}
Output Format: Output ONLY pure JSON. [USER PROMPT TEMPLATE] Target Timestamp: {start_time} - {end_time}. {audio_context} Task: Deconstruct this clip into a chronological causal graph. Step 1: Inventory ALL interacting entities, detailing visual attributes. Step 2: Map the specific, task-level physical actions. Step 3: Preserve direct visual evidence usef...
-
[56]
Use the original video frames as the primary evidence
-
[57]
Use the retrieved graph memory as complementary evidence from the same video
-
[58]
Compare all five options against the visible action, object, person, and purpose
-
[59]
If video evidence and graph memory disagree, prefer the directly visible video evidence
-
[60]
### [Retrieved RAG Memory] The following graph memory was extracted from the same video and may help identify temporal actions, objects, participants, and causal context
If evidence is incomplete, make the best forced choice from the available options. ### [Retrieved RAG Memory] The following graph memory was extracted from the same video and may help identify temporal actions, objects, participants, and causal context. {ctx} Output exactly one line and nothing else: [FINAL ANSWER: X] 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.