pith. sign in

arxiv: 2606.05008 · v1 · pith:VHSP2VI7new · submitted 2026-06-03 · 💻 cs.CV · cs.AI· cs.CL

M³Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Pith reviewed 2026-06-28 06:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords multi-modal modelsmemory evaluationvideo understandingbenchmarkcognitive psychologyinterferencespatial-temporal groundingsymbolic memory
0
0 comments X

The pith

M³Eval benchmark tests memory in multi-modal video models and finds they struggle to keep parallel streams disentangled while favoring spatial over temporal grounding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents M³Eval as the first benchmark designed to evaluate memory dimensions in multi-modal models for long-form video understanding. Tasks are built from cognitive psychology principles to separate memory retention, interference, and source grounding from perception or reasoning demands. Experiments on representative models show consistent patterns: difficulty maintaining separate representations across simultaneous video streams, interference behaviors unlike those in humans, stronger reliability for spatial memory sources than temporal ones, and weak performance on symbolic memory. These results position memory as a distinct capability that current models handle unevenly.

Core claim

M³Eval supplies a set of video tasks that isolate memory aspects such as handling interference from parallel streams, distinguishing spatial versus temporal source grounding, and retaining symbolic content. When applied to existing multi-modal models the tasks expose four recurring limitations: failure to preserve disentangled representations under concurrent inputs, interference signatures that diverge from human data, more accurate memory attachment to spatial cues than to temporal sequence, and restricted capacity for symbolic recall.

What carries the argument

M³Eval, an evaluation framework whose tasks isolate memory dimensions through cognitively-grounded video scenarios that probe retention fidelity, interference robustness, and domain-specific grounding.

If this is right

  • The benchmark supplies a reusable resource for testing memory mechanisms in future multi-modal models.
  • Insights from the observed weaknesses can guide construction of memory modules that better preserve disentanglement and temporal precision.
  • Systematic separation of memory evaluation from perception benchmarks becomes necessary as video understanding lengthens.
  • Design choices for new models should target the identified gaps in spatial-temporal balance and symbolic retention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The identified gaps could motivate training objectives that explicitly penalize cross-stream interference.
  • Extending the tasks to longer or more complex video collections might reveal scaling behavior of the same weaknesses.
  • Direct comparison of model outputs against human recall data on identical tasks could quantify how far the divergence extends.
  • The benchmark format might transfer to evaluating memory in single-modality language or audio models.

Load-bearing premise

The constructed tasks successfully isolate the intended memory dimensions from perception and reasoning confounds.

What would settle it

Running the parallel-stream and interference tasks on the same models and finding either fully disentangled representations or interference patterns that match human data would falsify the reported weaknesses.

Figures

Figures reproduced from arXiv: 2606.05008 by Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yiwu Zhong, Yixin Zhu.

Figure 1
Figure 1. Figure 1: M3Eval, our principled framework and benchmark for evaluating memory capabilities of multi-modal models. We present an example task of divided attention. Grounded in psychological theory, we construct split-screen video scenarios, design memory questions, and analyze multiple models in terms of source identification, order understanding, and content retention. implicitly involve memory, for example, long v… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the unified and coherent framework for our four evaluation paradigms. 3.1 Evaluation Design 3.1.1 Divided Attention: Encoding Concurrent Information Setup for Psychology Experiment Input Frame Swap Time Combined Split-screen Video Setup for Our Experiment: Divided Attention Dual-screen search task. Two tasks compete for attention and interfere with each other. video1 video2 Controllable frequen… view at source ↗
Figure 3
Figure 3. Figure 3: Divided Attention. Split-screen presentation with optional frame swaps. Psychological Theory. The divided attention paradigm originates from research on limited attentional resources and dual-task processing [27, 3]. In classic experiments, participants perform two tasks simultaneously, competing for attentional resources and resulting in reduced encoding quality and impaired memory retention [27, 9, 60, 5… view at source ↗
Figure 4
Figure 4. Figure 4: Memory Interference. Proactive interference: earlier learning disrupts later memory. Retroactive interference: later learning impairs earlier memory. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Interleaved Events. Interleaved presentation of video clips from two sources. Psychological Theory. Mandler [40, 39] demonstrated that, when presented with intermixed story￾lines, individuals spontaneously recover the underlying event structure rather than following surface presentation order. This paradigm has become a classic test for memory organization. Instantiation in Video Understanding. We divide t… view at source ↗
Figure 6
Figure 6. Figure 6: N-Back. Abstracting videos into symbols and comparing them. Psychological Theory. Unlike episodic memory, symbolic memory concerns the ability to abstract events into symbolic representations [1, 46]. N-Back tasks present sequences of symbolic stimuli (e.g., letters, digits, or simple shapes) and require participants to decide whether the current stimulus 5 [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention shifts induced by split-screen interference. For each case, the left panel shows the single-video condition, whereas the right panel shows the split-screen condition. In the split-screen setting, the question asks specifically about the left video. However, the model’s attention is disrupted by the concurrent right video, resulting in erroneous responses. mechanisms. With frequent swapping, the m… view at source ↗
Figure 8
Figure 8. Figure 8: Video repetition improves accuracy under interference. Repeating either the target or interfering video yields performance gains, suggesting repetition as a promising strategy for enhancing model memory. Further experiment. We test whether repetition strategy can improve robustness to interference. This is done by repeating the target or the inter￾fering video, forming [V1, V1, V2] and [V1, V2, V2] with qu… view at source ↗
Figure 9
Figure 9. Figure 9: Spatial source grounding outper￾forms temporal source grounding. Spatial source uses the split-screen format with frequent left/right swaps (§4.1); temporal source uses the interleaved format (§4.3). Main results. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Overall accuracy on the N-Back task. Performance of each model and human under two symbolic attributes (scene and action), averaged over all K and N configurations. Main results. Existing multi-modal models substantially lag behind humans, with many only slightly exceeding the random baseline. Among them, GPT-5.4 achieves the best performance. Interestingly, humans recall scene attributes more accurately … view at source ↗
Figure 11
Figure 11. Figure 11: Effects of N and K on accuracy. Points show per-model accuracy under different (N, K) settings, with linear fits for each model. The colored filled regions indicate ±1 standard deviation around the fit lines. Further experiment. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Duration histogram for the source videos in the non-N-Back portion of [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Example of Divided Attention targeting Source Identification. The three distractors replace certain content in the target video’s narrative with content from the distractor video, while the correct option (highlighted in yellow) faithfully describes only the target video. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of Divided Attention targeting Order Understanding. The three distractors swap the temporal or logical sequence of events in the target video’s narrative, while the correct option (highlighted in yellow) preserves the original order. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of Divided Attention targeting Content Retention. The three distractors replace certain content in the target video’s narrative with plausible but fabricated content, while the correct option (highlighted in yellow) faithfully describes only the target video. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of Memory Interference. Each question comprises the correct answer (highlighted [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example of Interleaved Events targeting Source Identification. The three distractors replace certain content in the target video’s narrative with content from the distractor video, while the correct option (highlighted in yellow) faithfully describes only the target video. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Example of Interleaved Events targeting Order Understanding. The three distractors swap the temporal or logical sequence of events in the target video’s narrative, while the correct option (highlighted in yellow) preserves the original order. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Example of Interleaved Events targeting Content Retention. The three distractors replace certain content in the target video’s narrative with plausible but fabricated content, while the correct option (highlighted in yellow) faithfully describes only the target video. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Example of Interleaved Events targeting False Memory Discrimination. A fake question that is relevant to video content is presented, and the model should be aware to choose the option indicating that the query does not belong to either video. The correct answer is highlighted in yellow. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Example of Source Memory. Spatial refers to a split-screen format with frequent left/right [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Example of N-Back. The model is asked to decide whether the final clip matches the clip [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
read the original abstract

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces M³Eval, the first comprehensive benchmark for evaluating memory dimensions in multi-modal models via cognitively-grounded video tasks designed to isolate memory from perception and reasoning. Through experiments on representative models, it reports consistent weaknesses: difficulty maintaining disentangled representations under parallel video streams, interference patterns unlike those in human memory, more reliable spatial than temporal grounding of memory sources, and limited symbolic memory. Code and dataset are released at the provided URL.

Significance. If the task isolation holds, this fills a clear gap by shifting focus from perception/reasoning to memory robustness in long-form video models, with findings that could guide mechanism design. Explicit credit is due for the public code and dataset release, which supports reproducibility and extension by the community.

major comments (2)
  1. [Abstract / task design] Abstract and task-construction description: the central claim that tasks 'isolate key aspects of memory' (and thereby make the reported weaknesses memory-specific) is load-bearing for all headline findings, yet no ablations, human-norming data, or controls are described that demonstrate performance is insensitive to perceptual noise or reasoning load variations.
  2. [Experiments] Experiments: no model list, statistical tests, or exclusion criteria are inspectable from the provided text, preventing verification that the 'consistent weaknesses' and 'distinctive behaviors' are robust rather than sensitive to post-hoc choices.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'extensive experiments across representative multi-modal models' would benefit from an explicit enumeration of the models even in the abstract for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and for recognizing the potential contribution of M³Eval along with the value of the released code and dataset. We address the two major comments below and will revise the manuscript to incorporate the requested clarifications and supporting analyses.

read point-by-point responses
  1. Referee: [Abstract / task design] Abstract and task-construction description: the central claim that tasks 'isolate key aspects of memory' (and thereby make the reported weaknesses memory-specific) is load-bearing for all headline findings, yet no ablations, human-norming data, or controls are described that demonstrate performance is insensitive to perceptual noise or reasoning load variations.

    Authors: We agree that explicit validation of task isolation is necessary to support the memory-specific interpretation of the results. The current manuscript grounds the tasks in cognitive psychology principles intended to separate memory from perception and reasoning, but does not include ablations, human-norming data, or targeted controls for perceptual noise and reasoning load. In revision we will add a dedicated subsection with these controls and preliminary human baselines to strengthen the isolation claim. revision: yes

  2. Referee: [Experiments] Experiments: no model list, statistical tests, or exclusion criteria are inspectable from the provided text, preventing verification that the 'consistent weaknesses' and 'distinctive behaviors' are robust rather than sensitive to post-hoc choices.

    Authors: The experiments section of the full manuscript lists the representative multi-modal models evaluated and provides implementation details. To improve inspectability and verifiability we will expand this section with an explicit model table, report appropriate statistical tests on the performance differences, and clearly document any exclusion criteria applied during data processing or evaluation. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no derivation chain or self-referential reductions

full rationale

The paper introduces M³Eval as a benchmark for memory evaluation in multi-modal models, with tasks designed from cognitive psychology principles. The abstract and description contain no equations, fitted parameters, predictions derived from inputs, uniqueness theorems, or ansatzes. Central claims consist of empirical observations from model evaluations on constructed tasks; the isolation of memory dimensions is presented as a design feature rather than a result that reduces to prior outputs by construction. No self-citation load-bearing steps appear. The work is self-contained as an empirical framework against external model testing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations or fitted constants are present. The central premise rests on the domain assumption that cognitive-psychology memory dimensions can be isolated in video tasks for AI models.

axioms (1)
  • domain assumption Cognitive psychology supplies isolatable memory dimensions that can be faithfully instantiated as video tasks without substantial perceptual or reasoning confounds.
    Stated directly in the abstract as the grounding for task design.

pith-pipeline@v0.9.1-grok · 5771 in / 1222 out tokens · 30306 ms · 2026-06-28T06:13:05.994595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

114 extracted references · 28 canonical work pages · 11 internal anchors

  1. [1]

    Psychology press, 2014

    John R Anderson and Gordon H Bower.Human associative memory. Psychology press, 2014

  2. [2]

    Infinibench: A benchmark for large multi-modal models in long-form movies and tv shows

    Kirolos Ataallah, Eslam Mohamed Bakr, Mahmoud Ahmed, Chenhui Gou, Khushbu Pahwa, Jian Ding, and Mohamed Elhoseiny. Infinibench: A benchmark for large multi-modal models in long-form movies and tv shows. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 19496–19523, 2025

  3. [3]

    Working memory.Comptes Rendus de l’Académie des Sciences-Series III- Sciences de la Vie, 321(2-3):167–173, 1998

    Alan Baddeley. Working memory.Comptes Rendus de l’Académie des Sciences-Series III- Sciences de la Vie, 321(2-3):167–173, 1998

  4. [4]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  5. [5]

    Mem-gallery: Benchmarking multimodal long-term conversational memory for mllm agents.arXiv preprint arXiv:2601.03515, 2026

    Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conversational memory for mllm agents.arXiv preprint arXiv:2601.03515, 2026

  6. [6]

    Distributed practice in verbal recall tasks: A review and quantitative synthesis.Psychological bulletin, 132 (3):354, 2006

    Nicholas J Cepeda, Harold Pashler, Edward Vul, John T Wixted, and Doug Rohrer. Distributed practice in verbal recall tasks: A review and quantitative synthesis.Psychological bulletin, 132 (3):354, 2006

  7. [7]

    Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024

    Keshigeyan Chandrasegaran, Agrim Gupta, Lea M Hadzic, Taran Kota, Jimming He, Cristóbal Eyzaguirre, Zane Durante, Manling Li, Jiajun Wu, and Li Fei-Fei. Hourvideo: 1-hour video- language understanding.Advances in Neural Information Processing Systems, 37:53168–53197, 2024

  8. [8]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025

  9. [9]

    The effects of divided attention on encoding and retrieval processes in human memory.Journal of Experimental Psychology: General, 125(2):159, 1996

    Fergus IM Craik, Richard Govoni, Moshe Naveh-Benjamin, and Nicole D Anderson. The effects of divided attention on encoding and retrieval processes in human memory.Journal of Experimental Psychology: General, 125(2):159, 1996

  10. [10]

    DeepSeek-V4: Towards highly efficient million-token context intelligence

    DeepSeek-AI. DeepSeek-V4: Towards highly efficient million-token context intelligence. Hugging Face model card, April 2026. URL https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro. Accessed: 2026-05-02

  11. [11]

    On the prediction of occurrence of particular verbal intrusions in immediate recall

    James Deese. On the prediction of occurrence of particular verbal intrusions in immediate recall. Journal of experimental psychology, 58(1):17, 1959

  12. [12]

    Videoagent: A memory-augmented multimodal agent for video understanding

    Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. InEuropean Conference on Computer Vision, pages 75–92. Springer, 2024. 10

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  14. [14]

    Working memory capacity of chatgpt: An empirical study

    Dongyu Gong, Xingchen Wan, and Dingmin Wang. Working memory capacity of chatgpt: An empirical study. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 10048–10056, 2024

  15. [15]

    Gemini 3.1 pro model card, 2026

    Google DeepMind. Gemini 3.1 pro model card, 2026. URL https://deepmind.google/ models/model-cards/gemini-3-1-pro/. Accessed: 2026-05-02

  16. [16]

    Repetition and memory.Psychology of learning and motivation, 10: 47–91, 1976

    Douglas L Hintzman. Repetition and memory.Psychology of learning and motivation, 10: 47–91, 1976

  17. [17]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  18. [18]

    Chatdb: Augmenting llms with databases as their symbolic memory.arXiv preprint arXiv:2306.03901, 2023

    Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao. Chatdb: Augmenting llms with databases as their symbolic memory.arXiv preprint arXiv:2306.03901, 2023

  19. [19]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

  20. [20]

    Nemo: Needle in a montage for video-language understanding

    Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, et al. Nemo: Needle in a montage for video-language understanding. arXiv preprint arXiv:2509.24563, 2025

  21. [21]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  22. [22]

    Evaluating the long-term memory of large language models

    Zixi Jia, Qinghua Liu, Hexiao Li, Yuyan Chen, and Jiqiang Liu. Evaluating the long-term memory of large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19759–19777, 2025

  23. [23]

    The ai hippocampus: How far are we from human memory? arXiv preprint arXiv:2601.09113, 2026

    Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, et al. The ai hippocampus: How far are we from human memory? arXiv preprint arXiv:2601.09113, 2026

  24. [24]

    Vad: Vectorized scene representation for efficient autonomous driving

    Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  25. [25]

    Source monitoring.Psychologi- cal bulletin, 114(1):3, 1993

    Marcia K Johnson, Shahin Hashtroudi, and D Stephen Lindsay. Source monitoring.Psychologi- cal bulletin, 114(1):3, 1993

  26. [26]

    Oxford University Press, 2024

    Michael J Kahana and Anthony D Wagner.The Oxford handbook of human memory, two volume pack: foundations and applications. Oxford University Press, 2024

  27. [27]

    Attention and effort.Experimental psychology, 1973

    D KAHNEMAN. Attention and effort.Experimental psychology, 1973

  28. [28]

    Age differences in short-term retention of rapidly changing information

    Wayne K Kirchner. Age differences in short-term retention of rapidly changing information. Journal of experimental psychology, 55(4):352, 1958

  29. [29]

    Babilong: Testing the limits of llms with long context reasoning-in-a-haystack

    Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack. Advances in Neural Information Processing Systems, 37:106519–106554, 2024. 11

  30. [30]

    Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Prompt repetition improves non-reasoning llms.arXiv preprint arXiv:2512.14982, 2025

  31. [31]

    Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models

    Jingyao Li, Jingyun Wang, Molin Tan, Haochen Wang, Cilin Yan, Likun Shi, Jiayin Cai, Xiaolong Jiang, and Yao Hu. Crossvid: A comprehensive benchmark for evaluating cross-video reasoning in multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6244–6252, 2026

  32. [32]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  33. [33]

    Two causally related needles in a video haystack.arXiv preprint arXiv:2505.19853, 2025

    Miaoyu Li, Qin Chao, and Boyang Li. Two causally related needles in a video haystack.arXiv preprint arXiv:2505.19853, 2025

  34. [34]

    Ai meets brain: Memory systems from cognitive neuroscience to autonomous agents.arXiv preprint arXiv:2512.23343, 2025

    Jiafeng Liang, Hao Li, Chang Li, Jiaqi Zhou, Shixin Jiang, Zekun Wang, Changkai Ji, Zhihao Zhu, Runxuan Liu, Tao Ren, et al. Ai meets brain: Memory systems from cognitive neuroscience to autonomous agents.arXiv preprint arXiv:2512.23343, 2025

  35. [35]

    Streamingbench: Assessing the gap for mllms to achieve streaming video understanding

    Junming Lin, Zheng Fang, Chi Chen, Haoxuan Cheng, Zihao Wan, Fuwen Luo, Ziyue Wang, Peng Li, Yang Liu, and Maosong Sun. Streamingbench: Assessing the gap for mllms to achieve streaming video understanding. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12147–12151. IEEE, 2026

  36. [36]

    Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

    Ruixun Liu, Lingyu Kong, Derun Li, and Hang Zhao. Occvla: Vision-language-action model with implicit 3d occupancy supervision.arXiv preprint arXiv:2509.05578, 2025

  37. [37]

    Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory

    Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, and Wei Li. Seeing, listening, remembering, and reasoning: A multimodal agent with long-term memory. arXiv preprint arXiv:2508.09736, 2025

  38. [38]

    Evaluating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  39. [39]

    A code in the node: The use of a story schema in retrieval.Discourse processes, 1(1):14–35, 1978

    Jean M Mandler. A code in the node: The use of a story schema in retrieval.Discourse processes, 1(1):14–35, 1978

  40. [40]

    Remembrance of things parsed: Story structure and recall.Cognitive psychology, 9(1):111–151, 1977

    Jean M Mandler and Nancy S Johnson. Remembrance of things parsed: Story structure and recall.Cognitive psychology, 9(1):111–151, 1977

  41. [41]

    Forgetting and the law of disuse.Psychological review, 39(4):352, 1932

    John A McGeoch. Forgetting and the law of disuse.Psychological review, 39(4):352, 1932

  42. [42]

    Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

    Junbo Niu, Yifei Li, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, et al. Ovo-bench: How far is your video-llms from real-world online video understanding? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18902–18913, 2025

  43. [43]

    Gpt-5.4 thinking system card, March 2026

    OpenAI. Gpt-5.4 thinking system card, March 2026. URL https://openai.com/ zh-Hans-CN/index/introducing-gpt-5-4/. Accessed: 2026-05-02

  44. [44]

    N-back working memory paradigm: A meta-analysis of normative functional neuroimaging studies.Human brain mapping, 25(1):46–59, 2005

    Adrian M Owen, Kathryn M McMillan, Angela R Laird, and Ed Bullmore. N-back working memory paradigm: A meta-analysis of normative functional neuroimaging studies.Human brain mapping, 25(1):46–59, 2005

  45. [45]

    Space and time in episodic memory: Effects of linearity and directionality on memory for spatial location and temporal order in children and adults.PLoS One, 13(11):e0206999, 2018

    Thanujeni Pathman, Christine Coughlin, and Simona Ghetti. Space and time in episodic memory: Effects of linearity and directionality on memory for spatial location and temporal order in children and adults.PLoS One, 13(11):e0206999, 2018

  46. [46]

    What the mind’s eye tells the mind’s brain: A critique of mental imagery

    Zenon W Pylyshyn. What the mind’s eye tells the mind’s brain: A critique of mental imagery. Psychological bulletin, 80(1):1, 1973. 12

  47. [47]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5. Accessed: 2026-05-02

  48. [48]

    Some factors determining the degree of retroactive inhibition

    Edward Stevens Robinson. Some factors determining the degree of retroactive inhibition. Psychological Monographs, 28(6):i, 1920

  49. [49]

    Creating false memories: Remembering words not presented in lists.Journal of experimental psychology: Learning, Memory, and Cognition, 21(4):803, 1995

    Henry L Roediger and Kathleen B McDermott. Creating false memories: Remembering words not presented in lists.Journal of experimental psychology: Learning, Memory, and Cognition, 21(4):803, 1995

  50. [50]

    Retrieval without recollection: An experimental analysis of source amnesia.Journal of verbal learning and verbal behavior, 23(5):593–611, 1984

    Daniel L Schacter, Joanne L Harbluk, and Donald R McLachlan. Retrieval without recollection: An experimental analysis of source amnesia.Journal of verbal learning and verbal behavior, 23(5):593–611, 1984

  51. [51]

    Cambridge University Press, 2022

    John W Schwieter and Zhisheng Edward Wen.The Cambridge handbook of working memory and language. Cambridge University Press, 2022

  52. [52]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

  53. [53]

    Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, and Gaoang Wang. Moviechat+: Question-aware sparse memory for long video question answering.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  54. [54]

    Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models

    Mingyang Song, Mao Zheng, and Xuan Luo. Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models. InProceedings of the 31st International Conference on Computational Linguistics, pages 3753–3763, 2025

  55. [55]

    Reconvla: Reconstructive vision- language-action model as effective robot perceiver

    Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, and Haoang Li. Reconvla: Reconstructive vision- language-action model as effective robot perceiver. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 18549–18557, 2026

  56. [56]

    Membench: Towards more comprehensive evaluation on the memory of llm-based agents

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

  57. [57]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024

  58. [58]

    Neurophysiological distinctions between spatial and temporal context in episodic memory.International Journal of Psychophysiology, page 113302, 2025

    César Torres-Morales and Selene Cansino. Neurophysiological distinctions between spatial and temporal context in episodic memory.International Journal of Psychophysiology, page 113302, 2025

  59. [59]

    Illusory conjunctions in the perception of objects.Cognitive psychology, 14(1):107–141, 1982

    Anne Treisman and Hilary Schmidt. Illusory conjunctions in the perception of objects.Cognitive psychology, 14(1):107–141, 1982

  60. [60]

    A feature-integration theory of attention.Cognitive psychology, 12(1):97–136, 1980

    Anne M Treisman and Garry Gelade. A feature-integration theory of attention.Cognitive psychology, 12(1):97–136, 1980

  61. [61]

    Interference and forgetting.Psychological review, 64(1):49, 1957

    Benton J Underwood. Interference and forgetting.Psychological review, 64(1):49, 1957

  62. [62]

    Time Blindness: Why Video-Language Models Can't See What Humans Can?

    Ujjwal Upadhyay, Mukul Ranjan, Zhiqiang Shen, and Mohamed Elhoseiny. Time blindness: Why video-language models can’t see what humans can?arXiv preprint arXiv:2505.24867, 2025

  63. [63]

    Symbolic working memory enhances language models for complex rule application

    Siyuan Wang, Zhongyu Wei, Yejin Choi, and Xiang Ren. Symbolic working memory enhances language models for complex rule application. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17583–17604, 2024. 13

  64. [64]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025

  65. [65]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  66. [66]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H Chi, et al. Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory.arXiv preprint arXiv:2511.20857, 2025

  67. [67]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  68. [68]

    Video-levelgauge: Investigating contextual positional bias in large video language models

    Hou Xia, Zheren Fu, Fangcan Ling, Jiajun Li, Yi Tu, Zhendong Mao, and Yongdong Zhang. Video-levelgauge: Investigating contextual positional bias in large video language models. arXiv preprint arXiv:2508.19650, 2025

  69. [69]

    Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

    Haomiao Xiong, Zongxin Yang, Jiazuo Yu, Yunzhi Zhuge, Lu Zhang, Jiawen Zhu, and Huchuan Lu. Streaming video understanding and multi-round interaction with memory-enhanced knowl- edge.arXiv preprint arXiv:2501.13468, 2025

  70. [70]

    Egolife: Towards egocentric life assistant

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, et al. Egolife: Towards egocentric life assistant. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 28885–28900, 2025

  71. [71]

    Cambrian-S: Towards Spatial Supersensing in Video

    Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis L. Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei- Fei, and Saining Xie. Cambrian-S: Towards spatial supersensing in video.arXiv preprint arXiv:2511.04670, 2025

  72. [72]

    Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

    Zhenyu Yang, Yuhang Hu, Zemin Du, Dizhan Xue, Shengsheng Qian, Jiahong Wu, Fan Yang, Weiming Dong, and Changsheng Xu. Svbench: A benchmark with temporal multi-turn dialogues for streaming video understanding.arXiv preprint arXiv:2502.10810, 2025

  73. [73]

    Temporal associations and prior-list intrusions in free recall.Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4):792, 2006

    Franklin M Zaromb, Marc W Howard, Emily D Dolan, Yevgeniy B Sirotin, Michele Tully, Arthur Wingfield, and Michael J Kahana. Temporal associations and prior-list intrusions in free recall.Journal of Experimental Psychology: Learning, Memory, and Cognition, 32(4):792, 2006

  74. [74]

    Working memory identifies reasoning limits in language models

    Chunhui Zhang, Yiren Jian, Zhongyu Ouyang, and Soroush V osoughi. Working memory identifies reasoning limits in language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16896–16922, 2024

  75. [75]

    A survey on the memory mechanism of large language model-based agents

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

  76. [76]

    Needle in a video haystack: A scalable synthetic evaluator for video mllms.arXiv preprint arXiv:2406.09367, 2024

    Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, Weipeng Chen, and Jing Liu. Needle in a video haystack: A scalable synthetic evaluator for video mllms.arXiv preprint arXiv:2406.09367, 2024

  77. [77]

    Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

    Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, and Qianli Ma. Lifelongagentbench: Evaluating llm agents as lifelong learners.arXiv preprint arXiv:2505.11942, 2025

  78. [78]

    Mlvu: Benchmarking multi-task long video understanding

    Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, et al. Mlvu: Benchmarking multi-task long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13691–13701, 2025. 14

  79. [79]

    X-lebench: A benchmark for extremely long egocentric video understanding.arXiv preprint arXiv:2501.06835, 2025

    Wenqi Zhou, Kai Cao, Hao Zheng, Yunze Liu, Xinyi Zheng, Miao Liu, Per Ola Kristensson, Walterio Mayol-Cuevas, Fan Zhang, Weizhe Lin, et al. X-lebench: A benchmark for extremely long egocentric video understanding.arXiv preprint arXiv:2501.06835, 2025

  80. [80]

    Cvbench: Evaluating cross-video synergies for complex multimodal understanding and reasoning.arXiv preprint arXiv:2508.19542, 2025

    Nannan Zhu, Yonghao Dong, Teng Wang, Xueqian Li, Shengjun Deng, Yijia Wang, Zheng Hong, Tiantian Geng, Guo Niu, Hanyan Huang, et al. Cvbench: Evaluating cross-video synergies for complex multimodal understanding and reasoning.arXiv preprint arXiv:2508.19542, 2025

Showing first 80 references.