pith. sign in

arxiv: 2605.15128 · v1 · pith:KZNUJZR5new · submitted 2026-05-14 · 💻 cs.CV · cs.CL· cs.IR

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Pith reviewed 2026-06-30 21:00 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR
keywords multimodal agent memoryvisual evaluationlong-term memorystate change reasoningbenchmarkVLM evaluationvisual evidencememory architecture
0
0 comments X

The pith

MemEye shows multimodal agents fail to preserve fine-grained visual details for state-change reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemEye to test whether long-term multimodal agent memory actually retains the visual evidence required for later reasoning. Prior evaluations often permitted answers from captions or text alone, leaving out cases that demand tracking changing visual states. The framework organizes evaluation along two axes: granularity of visual evidence needed, from scene level down to pixels, and complexity of evidence use, from single items to synthesis across time. A benchmark of eight life-scenario tasks includes gates that verify answerability, shortcut resistance, visual necessity, and reasoning structure. Tests of thirteen memory methods on four VLM backbones indicate that existing systems still lose fine visual details and cannot reliably follow state evolution.

Core claim

MemEye measures memory along visual-evidence granularity and synthesis requirements; when applied to current architectures it shows they struggle to preserve pixel-level details and to reason about evolutionary state changes over time.

What carries the argument

MemEye framework, which scores memory by the granularity of decisive visual evidence (scene-level to pixel-level) and by the type of evidence synthesis required (single evidence to evolutionary synthesis).

If this is right

  • Effective long-term multimodal memory requires explicit mechanisms for routing fine visual evidence.
  • Temporal tracking of visual state changes must be strengthened in memory architectures.
  • Detail extraction from stored visuals remains a primary performance bottleneck.
  • Future benchmarks should adopt similar gates to block textual shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that compress and index visual patches at multiple scales may reduce the observed detail loss.
  • The same evaluation structure could be applied to video or embodied-agent tasks to test generalization.
  • Training objectives that explicitly penalize loss of pixel-level information could be derived from the framework's axes.

Load-bearing premise

The ablation-driven validation gates in the benchmark correctly force questions to require stored visual evidence rather than allowing answers from captions or textual traces.

What would settle it

If agents achieve the same accuracy on the benchmark questions when visuals are withheld or replaced by captions alone, the claim that fine-grained visual preservation is necessary would be falsified.

read the original abstract

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MemEye, a visual-centric evaluation framework for multimodal agent memory. It defines two dimensions: granularity of decisive visual evidence (scene-level to pixel-level) and how retrieved evidence is used (single evidence to evolutionary synthesis). The authors construct a benchmark with 8 life-scenario tasks incorporating ablation-driven validation gates to ensure visual necessity and shortcut resistance. They evaluate 13 memory methods across 4 VLM backbones and conclude that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

Significance. If the benchmark's validation gates successfully enforce that questions require visual evidence rather than textual shortcuts, this work would provide a valuable new standard for assessing long-term memory in multimodal agents. It highlights specific architectural limitations in evidence routing, temporal tracking, and detail extraction, which could guide future research in the field.

major comments (2)
  1. [Abstract] Abstract: The abstract claims that the ablation-driven validation gates assess answerability, shortcut resistance, visual necessity, and reasoning structure, but supplies no quantitative pass rates, explicit ablation protocol details (such as caption-only, trace-only, or full removal conditions), or validation evidence. This is load-bearing for the central claim, as poor performance on fine-grained or evolutionary items could stem from incomplete resistance to textual shortcuts rather than true memory failure.
  2. [Evaluation (implied from abstract)] The results from evaluating 13 methods on 4 VLM backbones are presented without methodological details on data construction steps, how the memory methods are adapted, or the specific metrics used, making it impossible to assess the support for the findings.
minor comments (1)
  1. [Abstract] The phrasing 'life-scenario tasks' is vague; a more precise description of the 8 tasks would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in validation evidence and methodological details. We will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract claims that the ablation-driven validation gates assess answerability, shortcut resistance, visual necessity, and reasoning structure, but supplies no quantitative pass rates, explicit ablation protocol details (such as caption-only, trace-only, or full removal conditions), or validation evidence. This is load-bearing for the central claim, as poor performance on fine-grained or evolutionary items could stem from incomplete resistance to textual shortcuts rather than true memory failure.

    Authors: We agree that the abstract should provide quantitative validation evidence to support the central claims. In the revision, we will expand the abstract to include pass rates for the ablation conditions (caption-only, trace-only, and full removal) along with a concise description of the protocol. We will also ensure the main text includes the full validation results and evidence. revision: yes

  2. Referee: [Evaluation (implied from abstract)] The results from evaluating 13 methods on 4 VLM backbones are presented without methodological details on data construction steps, how the memory methods are adapted, or the specific metrics used, making it impossible to assess the support for the findings.

    Authors: We acknowledge that additional methodological details are required for reproducibility and assessment of the findings. In the revised manuscript, we will expand the relevant sections to describe the data construction steps in detail, how each of the 13 memory methods was adapted to the four VLM backbones, and the exact metrics employed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no self-referential derivations

full rationale

The paper presents an empirical evaluation framework and benchmark for multimodal agent memory. It constructs tasks, applies ablation-driven validation gates, and reports performance of 13 methods across 4 backbones. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claims rest on external benchmark results rather than reducing to self-defined quantities or prior author work by construction. This matches the default expectation of no significant circularity for evaluation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5784 in / 969 out tokens · 30695 ms · 2026-06-30T21:00:29.371220+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

    cs.AI 2026-05 unverdicted novelty 7.0

    EEG study of 27 participants reveals distinct neural patterns for AI-generated hallucinations, with misjudged ones failing to trigger standard fact verification pathways.

  2. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

    cs.AI 2026-05 unverdicted novelty 6.0

    EEG study reveals distinct ERP patterns for AI hallucinations, with misjudged ones failing to trigger standard neurocognitive verification pathways.

Reference graph

Works this paper leans on

50 extracted references · 32 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Mem-Gallery: Benchmarking multimodal long-term conversational memory for MLLM agents, 2026

    Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conver- sational memory for mllm agents.arXiv preprint arXiv:2601.03515, 2026

  3. [3]

    Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

    Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

  4. [4]

    Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025

    Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, and Shengyu Zhang. Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025. URLhttps://arxiv.org/abs/2508.04010

  5. [5]

    Evaluating the robustness of multimodal agents against active environmental injection attacks

    Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11648–11656, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400720352. doi: 10.1145/37460...

  6. [6]

    Safepred: A predictive guardrail for computer-using agents via world models, 2026

    Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, and Shengyu Zhang. Safepred: A predictive guardrail for computer-using agents via world models, 2026. URLhttps://arxiv.org/abs/2602 .01725

  7. [7]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  8. [8]

    Moura, Devi Parikh, and Dhruv Batra

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  9. [9]

    Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

    Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu. Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

  10. [10]

    M2A: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions, 2026

    Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2a: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions.arXiv preprint arXiv:2602.07624, 2026

  11. [11]

    Geminiapimodeldocumentation

    Google. Geminiapimodeldocumentation. https://ai.google.dev/gemini-api/docs/models,

  12. [13]

    Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, and Dimitris N. Metaxas. Individual turing test: A case study of llm-based simulation using longitudinal personal data, 2026. URLhttps: //arxiv.org/abs/2603.01289

  13. [14]

    DeepSieve: Information sieving via LLM-as-a-knowledge-router

    Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. DeepSieve: Information sieving via LLM-as-a-knowledge-router. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, Rabat, Morocco, March 2026. Association fo...

  14. [15]

    Evaluating memory in LLM agents via incremental multi- turn interactions

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via incremental multi- turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=DT7JyQC3MR

  15. [16]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

  16. [17]

    Automatic understanding of image and video advertisements

    Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agber, Ralph Olen, and Adriana Kovashka. Automatic understanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  17. [18]

    Memory OS of AI agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of 13 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961–25970, Suzhou,...

  18. [19]

    Astyle-basedgeneratorarchitectureforgenerativeadversarial networks

    TeroKarras, SamuliLaine, andTimoAila. Astyle-basedgeneratorarchitectureforgenerativeadversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  19. [20]

    Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. CLEVR-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...

  20. [21]

    Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

    Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao,XuankunRong,EojinJoo,Seung-HoHan,BowonKo,andHo-JinChoi. Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

  21. [22]

    Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

    Aiden Yiliu Li, Xinyue Hao, Shilong Liu, and Mengdi Wang. Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

  22. [23]

    Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia

    Danrui Li, Sen Zhang, Samuel S. Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia. Cardiverse: Harnessing LLMs for novel card game prototyping. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29735–29762, Suzhou, Ch...

  23. [24]

    ISBN 979-8-89176-332-6

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.e mnlp-main.1511. URLhttps://aclanthology.org/2025.emnlp-main.1511/

  24. [25]

    SimpleMem: Efficient Lifelong Memory for LLM Agents

    Jiaqi Liu, Yaofeng Su, Peng Xia, Yiyang Zhou, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553,

  25. [26]

    URLhttps://github.com/aiming-lab/SimpleMem

  26. [27]

    Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

    Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

  27. [28]

    Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models

    Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NI...

  28. [29]

    Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

    Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, et al. Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

  29. [30]

    Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026

    Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026. 14 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

  30. [31]

    Evaluating very long-term conversational memory of LLM agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangko...

  31. [32]

    Evaluating very long-term conversational memory of LLM agents

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.747. URL https://aclanthology.org/2024.acl-long.747/

  32. [33]

    According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026

    Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, and Bill Byrne. According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026. doi: 10.48550/arXiv.2603.01990. URLhttps://arxiv.org/abs/2603.01990

  33. [34]

    R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

    Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, and Jiarong Jiang. R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

  34. [35]

    Openai api model documentation

    OpenAI. Openai api model documentation. https://platform.openai.com/docs/models ,

  35. [36]

    Accessed: 2026-05-01

  36. [37]

    Steering the Verifiability of Multimodal AI Hallucinations

    Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, and Yu-Gang Jiang. Steering the verifiability of multimodal ai hallucinations.arXiv preprint arXiv:2604.06714, 2026

  37. [38]

    O'Brien, Carrie J

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi:...

  38. [39]

    From commands to prompts: LLM-based semantic file system for AIOS

    Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, and Yongfeng Zhang. From commands to prompts: LLM-based semantic file system for AIOS. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=2G021ZqUEZ

  39. [40]

    Reflexion: language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

  40. [41]

    Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024

    Turing Motors. Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024. Accessed: 2026-05-01

  41. [42]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents

    Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

  42. [43]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  43. [44]

    Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

    Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

  44. [45]

    Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

    Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 15 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

  45. [46]

    Crab: Cross-environment agent benchmark for multimodal language model agents

    Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21607–21647, 2025

  46. [47]

    A-mem: Agentic memory for llm agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems, 2025

  47. [48]

    AEL: Agent Evolving Learning for Open-Ended Environments

    Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, and Dimitris N Metaxas. Ael: Agent evolving learning for open-ended environments.arXiv preprint arXiv:2604.21725, 2026

  48. [49]

    MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation

    Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, and Yu Qiao. MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Sh...

  49. [50]

    Seed- story: Multimodal long story generation with large language model

    Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed- story: Multimodal long story generation with large language model. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1871–1881, 2025. doi: 10.1109/ICCVW690 36.2025.00197

  50. [51]

    X2 is about WHERE things are in a local region; X3 is about WHICH specific instance is being referred to among similar candidates

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Trans. Inf. Syst., 43(6), September 2025. ISSN 1046-8188. doi: 10.1145/3748302. URLhttps: //doi.org/10.1145/3748302. A. Benchmark Construction and Dataset Details A.1. Task Stati...