MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Boxuan Zhang; Danrui Li; Dimitris N. Metaxas; Jiang Liu; Liwei Che; Mengdi Wang; Minghao Guo; Mubbasir Kapadia; Qingyue Jiao; Ruixiang Tang

arxiv: 2605.15128 · v1 · pith:KZNUJZR5new · submitted 2026-05-14 · 💻 cs.CV · cs.CL· cs.IR

MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

Minghao Guo , Qingyue Jiao , Zeru Shi , Yihao Quan , Boxuan Zhang , Danrui Li , Liwei Che , Wujiang Xu

show 9 more authors

Shilong Liu Zirui Liu Mubbasir Kapadia Vladimir Pavlovic Jiang Liu Mengdi Wang Yiyu Shi Dimitris N. Metaxas Ruixiang Tang

This is my paper

Pith reviewed 2026-06-30 21:00 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.IR

keywords multimodal agent memoryvisual evaluationlong-term memorystate change reasoningbenchmarkVLM evaluationvisual evidencememory architecture

0 comments

The pith

MemEye shows multimodal agents fail to preserve fine-grained visual details for state-change reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemEye to test whether long-term multimodal agent memory actually retains the visual evidence required for later reasoning. Prior evaluations often permitted answers from captions or text alone, leaving out cases that demand tracking changing visual states. The framework organizes evaluation along two axes: granularity of visual evidence needed, from scene level down to pixels, and complexity of evidence use, from single items to synthesis across time. A benchmark of eight life-scenario tasks includes gates that verify answerability, shortcut resistance, visual necessity, and reasoning structure. Tests of thirteen memory methods on four VLM backbones indicate that existing systems still lose fine visual details and cannot reliably follow state evolution.

Core claim

MemEye measures memory along visual-evidence granularity and synthesis requirements; when applied to current architectures it shows they struggle to preserve pixel-level details and to reason about evolutionary state changes over time.

What carries the argument

MemEye framework, which scores memory by the granularity of decisive visual evidence (scene-level to pixel-level) and by the type of evidence synthesis required (single evidence to evolutionary synthesis).

If this is right

Effective long-term multimodal memory requires explicit mechanisms for routing fine visual evidence.
Temporal tracking of visual state changes must be strengthened in memory architectures.
Detail extraction from stored visuals remains a primary performance bottleneck.
Future benchmarks should adopt similar gates to block textual shortcuts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures that compress and index visual patches at multiple scales may reduce the observed detail loss.
The same evaluation structure could be applied to video or embodied-agent tasks to test generalization.
Training objectives that explicitly penalize loss of pixel-level information could be derived from the framework's axes.

Load-bearing premise

The ablation-driven validation gates in the benchmark correctly force questions to require stored visual evidence rather than allowing answers from captions or textual traces.

What would settle it

If agents achieve the same accuracy on the benchmark questions when visuals are withheld or replaced by captions alone, the claim that fine-grained visual preservation is necessary would be falsified.

read the original abstract

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemEye introduces a two-axis benchmark to test if agents retain fine visual details instead of text shortcuts, but the abstract leaves the gate validation thin.

read the letter

The main point is that this paper builds a benchmark meant to force multimodal agents to keep actual visual evidence rather than falling back on captions or traces. They split evaluation into granularity of the visual evidence needed and how much synthesis over time is required, then created eight life-scenario tasks with ablation checks for answerability and shortcut resistance.

What stands out is the explicit attempt to close the loophole where prior tests let models succeed without preserving pixels or tracking state changes. Running thirteen memory methods across four VLM backbones and reporting consistent weakness on fine-grained and evolutionary items gives a concrete picture of current limits. That part is straightforward and useful for anyone building or evaluating agent memory.

The soft spot is the validation of those ablation gates. The abstract describes them as checking visual necessity and shortcut resistance, yet supplies no pass rates, no breakdown of caption-only versus trace-only conditions, and no quantitative confirmation that the questions truly cannot be answered from text alone. If the gates are incomplete, the headline result about struggling with details and state changes could partly reflect benchmark leakage rather than memory failure. The full paper may address this, but the provided description does not.

This is for researchers working on long-term multimodal agents or evaluation protocols. A reader who needs concrete tasks to measure visual retention will find usable material here. It deserves peer review because the framing is clear, the scale of the evaluation is reasonable, and the core problem it targets is real, even if the gate evidence needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper introduces MemEye, a visual-centric evaluation framework for multimodal agent memory. It defines two dimensions: granularity of decisive visual evidence (scene-level to pixel-level) and how retrieved evidence is used (single evidence to evolutionary synthesis). The authors construct a benchmark with 8 life-scenario tasks incorporating ablation-driven validation gates to ensure visual necessity and shortcut resistance. They evaluate 13 memory methods across 4 VLM backbones and conclude that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.

Significance. If the benchmark's validation gates successfully enforce that questions require visual evidence rather than textual shortcuts, this work would provide a valuable new standard for assessing long-term memory in multimodal agents. It highlights specific architectural limitations in evidence routing, temporal tracking, and detail extraction, which could guide future research in the field.

major comments (2)

[Abstract] Abstract: The abstract claims that the ablation-driven validation gates assess answerability, shortcut resistance, visual necessity, and reasoning structure, but supplies no quantitative pass rates, explicit ablation protocol details (such as caption-only, trace-only, or full removal conditions), or validation evidence. This is load-bearing for the central claim, as poor performance on fine-grained or evolutionary items could stem from incomplete resistance to textual shortcuts rather than true memory failure.
[Evaluation (implied from abstract)] The results from evaluating 13 methods on 4 VLM backbones are presented without methodological details on data construction steps, how the memory methods are adapted, or the specific metrics used, making it impossible to assess the support for the findings.

minor comments (1)

[Abstract] The phrasing 'life-scenario tasks' is vague; a more precise description of the 8 tasks would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in validation evidence and methodological details. We will revise the manuscript accordingly to strengthen these aspects.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract claims that the ablation-driven validation gates assess answerability, shortcut resistance, visual necessity, and reasoning structure, but supplies no quantitative pass rates, explicit ablation protocol details (such as caption-only, trace-only, or full removal conditions), or validation evidence. This is load-bearing for the central claim, as poor performance on fine-grained or evolutionary items could stem from incomplete resistance to textual shortcuts rather than true memory failure.

Authors: We agree that the abstract should provide quantitative validation evidence to support the central claims. In the revision, we will expand the abstract to include pass rates for the ablation conditions (caption-only, trace-only, and full removal) along with a concise description of the protocol. We will also ensure the main text includes the full validation results and evidence. revision: yes
Referee: [Evaluation (implied from abstract)] The results from evaluating 13 methods on 4 VLM backbones are presented without methodological details on data construction steps, how the memory methods are adapted, or the specific metrics used, making it impossible to assess the support for the findings.

Authors: We acknowledge that additional methodological details are required for reproducibility and assessment of the findings. In the revised manuscript, we will expand the relevant sections to describe the data construction steps in detail, how each of the 13 memory methods was adapted to the four VLM backbones, and the exact metrics employed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no self-referential derivations

full rationale

The paper presents an empirical evaluation framework and benchmark for multimodal agent memory. It constructs tasks, applies ablation-driven validation gates, and reports performance of 13 methods across 4 backbones. No equations, fitted parameters, predictions derived from inputs, or self-citation chains are present in the provided text. The central claims rest on external benchmark results rather than reducing to self-defined quantities or prior author work by construction. This matches the default expectation of no significant circularity for evaluation papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5784 in / 969 out tokens · 30695 ms · 2026-06-30T21:00:29.371220+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
cs.AI 2026-05 unverdicted novelty 7.0

EEG study of 27 participants reveals distinct neural patterns for AI-generated hallucinations, with misjudged ones failing to trigger standard fact verification pathways.
How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study
cs.AI 2026-05 unverdicted novelty 6.0

EEG study reveals distinct ERP patterns for AI hallucinations, with misjudged ones failing to trigger standard neurocognitive verification pathways.

Reference graph

Works this paper leans on

50 extracted references · 32 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Mem-Gallery: Benchmarking multimodal long-term conversational memory for MLLM agents, 2026

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conver- sational memory for mllm agents.arXiv preprint arXiv:2601.03515, 2026

work page arXiv 2026
[3]

Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

2008
[4]

Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025

Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, and Shengyu Zhang. Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025. URLhttps://arxiv.org/abs/2508.04010

work page arXiv 2025
[5]

Evaluating the robustness of multimodal agents against active environmental injection attacks

Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11648–11656, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400720352. doi: 10.1145/37460...

work page doi:10.1145/3746027.3755646 2025
[6]

Safepred: A predictive guardrail for computer-using agents via world models, 2026

Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, and Shengyu Zhang. Safepred: A predictive guardrail for computer-using agents via world models, 2026. URLhttps://arxiv.org/abs/2602 .01725

2026
[7]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Moura, Devi Parikh, and Dhruv Batra

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[9]

Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu. Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

work page arXiv 2025
[10]

M2A: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions, 2026

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2a: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions.arXiv preprint arXiv:2602.07624, 2026

work page arXiv 2026
[11]

Geminiapimodeldocumentation

Google. Geminiapimodeldocumentation. https://ai.google.dev/gemini-api/docs/models,
[13]

Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, and Dimitris N. Metaxas. Individual turing test: A case study of llm-based simulation using longitudinal personal data, 2026. URLhttps: //arxiv.org/abs/2603.01289

work page arXiv 2026
[14]

DeepSieve: Information sieving via LLM-as-a-knowledge-router

Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. DeepSieve: Information sieving via LLM-as-a-knowledge-router. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, Rabat, Morocco, March 2026. Association fo...

work page doi:10.18653/v1/2026.findings-eacl.160 2026
[15]

Evaluating memory in LLM agents via incremental multi- turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via incremental multi- turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=DT7JyQC3MR

2026
[16]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Automatic understanding of image and video advertisements

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agber, Ralph Olen, and Adriana Kovashka. Automatic understanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017
[18]

Memory OS of AI agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of 13 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961–25970, Suzhou,...

work page doi:10.18653/v1/2025.emnlp-main.1318 2025
[19]

Astyle-basedgeneratorarchitectureforgenerativeadversarial networks

TeroKarras, SamuliLaine, andTimoAila. Astyle-basedgeneratorarchitectureforgenerativeadversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019
[20]

Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. CLEVR-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/n19-1058 2019
[21]

Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao,XuankunRong,EojinJoo,Seung-HoHan,BowonKo,andHo-JinChoi. Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

work page arXiv 2025
[22]

Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

Aiden Yiliu Li, Xinyue Hao, Shilong Liu, and Mengdi Wang. Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

work page arXiv 2026
[23]

Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia

Danrui Li, Sen Zhang, Samuel S. Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia. Cardiverse: Harnessing LLMs for novel card game prototyping. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29735–29762, Suzhou, Ch...

2025
[24]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.e mnlp-main.1511. URLhttps://aclanthology.org/2025.emnlp-main.1511/

work page doi:10.18653/v1/2025.e 2025
[25]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Yiyang Zhou, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

URLhttps://github.com/aiming-lab/SimpleMem
[27]

Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

work page arXiv 2026
[28]

Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NI...

2024
[29]

Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, et al. Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

work page arXiv 2024
[30]

Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026

Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026. 14 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

work page arXiv 2026
[31]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangko...
[32]

Evaluating very long-term conversational memory of LLM agents

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.747. URL https://aclanthology.org/2024.acl-long.747/

work page doi:10.18653/v1/2024.acl-long.747 2024
[33]

According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, and Bill Byrne. According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026. doi: 10.48550/arXiv.2603.01990. URLhttps://arxiv.org/abs/2603.01990

work page doi:10.48550/arxiv.2603.01990 2026
[34]

R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, and Jiarong Jiang. R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

work page arXiv 2025
[35]

Openai api model documentation

OpenAI. Openai api model documentation. https://platform.openai.com/docs/models ,
[36]

Accessed: 2026-05-01

2026
[37]

Steering the Verifiability of Multimodal AI Hallucinations

Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, and Yu-Gang Jiang. Steering the verifiability of multimodal ai hallucinations.arXiv preprint arXiv:2604.06714, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

O'Brien, Carrie J

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi:...

work page doi:10.1145/3586183.3606763 2023
[39]

From commands to prompts: LLM-based semantic file system for AIOS

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, and Yongfeng Zhang. From commands to prompts: LLM-based semantic file system for AIOS. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=2G021ZqUEZ

2025
[40]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

2023
[41]

Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024

Turing Motors. Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024. Accessed: 2026-05-01

2024
[42]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

work page arXiv 2024
[45]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 15 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

work page arXiv 2026
[46]

Crab: Cross-environment agent benchmark for multimodal language model agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21607–21647, 2025

2025
[47]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems, 2025

2025
[48]

AEL: Agent Evolving Learning for Open-Ended Environments

Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, and Dimitris N Metaxas. Ael: Agent evolving learning for open-ended environments.arXiv preprint arXiv:2604.21725, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation

Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, and Yu Qiao. MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Sh...

work page doi:10.18653/v1/2025.acl-long.1096 2025
[50]

Seed- story: Multimodal long story generation with large language model

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed- story: Multimodal long story generation with large language model. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1871–1881, 2025. doi: 10.1109/ICCVW690 36.2025.00197

work page doi:10.1109/iccvw690 2025
[51]

X2 is about WHERE things are in a local region; X3 is about WHICH specific instance is being referred to among similar candidates

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Trans. Inf. Syst., 43(6), September 2025. ISSN 1046-8188. doi: 10.1145/3748302. URLhttps: //doi.org/10.1145/3748302. A. Benchmark Construction and Dataset Details A.1. Task Stati...

work page doi:10.1145/3748302 2025

[1] [1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Mem-Gallery: Benchmarking multimodal long-term conversational memory for MLLM agents, 2026

Yuanchen Bei, Tianxin Wei, Xuying Ning, Yanjun Zhao, Zhining Liu, Xiao Lin, Yada Zhu, Hendrik Hamann, Jingrui He, and Hanghang Tong. Mem-gallery: Benchmarking multimodal long-term conver- sational memory for mllm agents.arXiv preprint arXiv:2601.03515, 2026

work page arXiv 2026

[3] [3]

Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

Timothy F Brady, Talia Konkle, George A Alvarez, and Aude Oliva. Visual long-term memory has a massive storage capacity for object details.Proceedings of the National Academy of Sciences, 105(38): 14325–14329, 2008

2008

[4] [4]

Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025

Yurun Chen, Xavier Hu, Yuhan Liu, Keting Yin, Juncheng Li, Zhuosheng Zhang, and Shengyu Zhang. Harmonyguard: Toward safety and utility in web agents via adaptive policy enhancement and dual- objective optimization, 2025. URLhttps://arxiv.org/abs/2508.04010

work page arXiv 2025

[5] [5]

Evaluating the robustness of multimodal agents against active environmental injection attacks

Yurun Chen, Xueyu Hu, Keting Yin, Juncheng Li, and Shengyu Zhang. Evaluating the robustness of multimodal agents against active environmental injection attacks. InProceedings of the 33rd ACM International Conference on Multimedia, MM ’25, page 11648–11656, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400720352. doi: 10.1145/37460...

work page doi:10.1145/3746027.3755646 2025

[6] [6]

Safepred: A predictive guardrail for computer-using agents via world models, 2026

Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, and Shengyu Zhang. Safepred: A predictive guardrail for computer-using agents via world models, 2026. URLhttps://arxiv.org/abs/2602 .01725

2026

[7] [7]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Moura, Devi Parikh, and Dhruv Batra

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M.F. Moura, Devi Parikh, and Dhruv Batra. Visual Dialog. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[9] [9]

Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu. Twinvoice: A multi-dimensional benchmark towards digital twins via llm persona simulation.arXiv preprint arXiv:2510.25536, 2025

work page arXiv 2025

[10] [10]

M2A: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions, 2026

Junyu Feng, Binxiao Xu, Jiayi Chen, Mengyu Dai, Cenyang Wu, Haodong Li, Bohan Zeng, Yunliu Xie, Hao Liang, Ming Lu, and Wentao Zhang. M2a: Multimodal memory agent with dual-layer hybrid memory for long-term personalized interactions.arXiv preprint arXiv:2602.07624, 2026

work page arXiv 2026

[11] [11]

Geminiapimodeldocumentation

Google. Geminiapimodeldocumentation. https://ai.google.dev/gemini-api/docs/models,

[12] [13]

Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, and Dimitris N. Metaxas. Individual turing test: A case study of llm-based simulation using longitudinal personal data, 2026. URLhttps: //arxiv.org/abs/2603.01289

work page arXiv 2026

[13] [14]

DeepSieve: Information sieving via LLM-as-a-knowledge-router

Minghao Guo, Qingcheng Zeng, Xujiang Zhao, Yanchi Liu, Wenchao Yu, Mengnan Du, Haifeng Chen, and Wei Cheng. DeepSieve: Information sieving via LLM-as-a-knowledge-router. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 3054–3077, Rabat, Morocco, March 2026. Association fo...

work page doi:10.18653/v1/2026.findings-eacl.160 2026

[14] [15]

Evaluating memory in LLM agents via incremental multi- turn interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in LLM agents via incremental multi- turn interactions. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=DT7JyQC3MR

2026

[15] [16]

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [17]

Automatic understanding of image and video advertisements

Zaeem Hussain, Mingda Zhang, Xiaozhong Zhang, Keren Ye, Christopher Thomas, Zuha Agber, Ralph Olen, and Adriana Kovashka. Automatic understanding of image and video advertisements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

2017

[17] [18]

Memory OS of AI agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of 13 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961–25970, Suzhou,...

work page doi:10.18653/v1/2025.emnlp-main.1318 2025

[18] [19]

Astyle-basedgeneratorarchitectureforgenerativeadversarial networks

TeroKarras, SamuliLaine, andTimoAila. Astyle-basedgeneratorarchitectureforgenerativeadversarial networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

2019

[19] [20]

Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus Rohrbach. CLEVR-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/n19-1058 2019

[20] [21]

Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao,XuankunRong,EojinJoo,Seung-HoHan,BowonKo,andHo-JinChoi. Multiverse: Amulti-turncon- versation benchmark for evaluating large vision and language models.arXiv preprint arXiv:2510.16641, 2025

work page arXiv 2025

[21] [22]

Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

Aiden Yiliu Li, Xinyue Hao, Shilong Liu, and Mengdi Wang. Avenir-web: Human-experience-imitating multimodal web agents with mixture of grounding experts.arXiv preprint arXiv:2602.02468, 2026

work page arXiv 2026

[22] [23]

Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia

Danrui Li, Sen Zhang, Samuel S. Sohn, Kaidong Hu, Muhammad Usman, and Mubbasir Kapadia. Cardiverse: Harnessing LLMs for novel card game prototyping. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 29735–29762, Suzhou, Ch...

2025

[23] [24]

ISBN 979-8-89176-332-6

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.e mnlp-main.1511. URLhttps://aclanthology.org/2025.emnlp-main.1511/

work page doi:10.18653/v1/2025.e 2025

[24] [25]

SimpleMem: Efficient Lifelong Memory for LLM Agents

Jiaqi Liu, Yaofeng Su, Peng Xia, Yiyang Zhou, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [26]

URLhttps://github.com/aiming-lab/SimpleMem

[26] [27]

Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, and Huaxiu Yao. Omni-simplemem: Autoresearch-guided discovery of lifelong multimodal agent memory.arXiv preprint arXiv:2604.01007, 2026

work page arXiv 2026

[27] [28]

Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models

Shuo Liu, Kaining Ying, Hao Zhang, Yue Yang, Yuqi Lin, Tianle Zhang, Chuanhao Li, Yu Qiao, Ping Luo, Wenqi Shao, and Kaipeng Zhang. Convbench: a multi-turn conversation evaluation benchmark with hierarchical ablation capability for large vision-language models. InProceedings of the 38th International Conference on Neural Information Processing Systems, NI...

2024

[28] [29]

Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, et al. Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms.arXiv preprint arXiv:2406.11833, 2024

work page arXiv 2024

[29] [30]

Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026

Yihao Lu, Wanru Cheng, Zeyu Zhang, and Hao Tang. Mma: Multimodal memory agent.arXiv preprint arXiv:2602.16493, 2026. 14 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

work page arXiv 2026

[30] [31]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangko...

[31] [32]

Evaluating very long-term conversational memory of LLM agents

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.747. URL https://aclanthology.org/2024.acl-long.747/

work page doi:10.18653/v1/2024.acl-long.747 2024

[32] [33]

According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, and Bill Byrne. According to me: Long-term personalized referential memory qa.arXiv preprint arXiv:2603.01990, 2026. doi: 10.48550/arXiv.2603.01990. URLhttps://arxiv.org/abs/2603.01990

work page doi:10.48550/arxiv.2603.01990 2026

[33] [34]

R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, and Jiarong Jiang. R-wom: Retrieval-augmented world model for computer-use agents.arXiv preprint arXiv:2510.11892, 2025

work page arXiv 2025

[34] [35]

Openai api model documentation

OpenAI. Openai api model documentation. https://platform.openai.com/docs/models ,

[35] [36]

Accessed: 2026-05-01

2026

[36] [37]

Steering the Verifiability of Multimodal AI Hallucinations

Jianhong Pang, Ruoxi Cheng, Ziyi Ye, Xingjun Ma, Zuxuan Wu, Xuanjing Huang, and Yu-Gang Jiang. Steering the verifiability of multimodal ai hallucinations.arXiv preprint arXiv:2604.06714, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [38]

O'Brien, Carrie J

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701320. doi:...

work page doi:10.1145/3586183.3606763 2023

[38] [39]

From commands to prompts: LLM-based semantic file system for AIOS

Zeru Shi, Kai Mei, Mingyu Jin, Yongye Su, Chaoji Zuo, Wenyue Hua, Wujiang Xu, Yujie Ren, Zirui Liu, Mengnan Du, Dong Deng, and Yongfeng Zhang. From commands to prompts: LLM-based semantic file system for AIOS. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=2G021ZqUEZ

2025

[39] [40]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc

2023

[40] [41]

Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024

Turing Motors. Japan open driving dataset sample.https://huggingface.co/datasets/turi ng-motors/Japan-Open-Driving-Dataset-Sample, 2024. Accessed: 2026-05-01

2024

[41] [42]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [43]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [44]

Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

Junlin Xie, Zhihong Chen, Ruifei Zhang, Xiang Wan, and Guanbin Li. Large multimodal agents: A survey.arXiv preprint arXiv:2402.15116, 2024

work page arXiv 2024

[44] [45]

Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026

Yiweng Xie, Bo He, Junke Wang, Xiangyu Zheng, Ziyi Ye, and Zuxuan Wu. Fluxmem: Adaptive hierarchical memory for streaming video understanding.arXiv preprint arXiv:2603.02096, 2026. 15 MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

work page arXiv 2026

[45] [46]

Crab: Cross-environment agent benchmark for multimodal language model agents

Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, et al. Crab: Cross-environment agent benchmark for multimodal language model agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 21607–21647, 2025

2025

[46] [47]

A-mem: Agentic memory for llm agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents. InAdvances in Neural Information Processing Systems, 2025

2025

[47] [48]

AEL: Agent Evolving Learning for Open-Ended Environments

Wujiang Xu, Jiaojiao Han, Minghao Guo, Kai Mei, Xi Zhu, Han Zhang, and Dimitris N Metaxas. Ael: Agent evolving learning for open-ended environments.arXiv preprint arXiv:2604.21725, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [49]

MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation

Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, and Yu Qiao. MMRC: A large-scale benchmark for understanding multimodal large language model in real-world conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Sh...

work page doi:10.18653/v1/2025.acl-long.1096 2025

[49] [50]

Seed- story: Multimodal long story generation with large language model

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, and Yingcong Chen. Seed- story: Multimodal long story generation with large language model. In2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 1871–1881, 2025. doi: 10.1109/ICCVW690 36.2025.00197

work page doi:10.1109/iccvw690 2025

[50] [51]

X2 is about WHERE things are in a local region; X3 is about WHICH specific instance is being referred to among similar candidates

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents.ACM Trans. Inf. Syst., 43(6), September 2025. ISSN 1046-8188. doi: 10.1145/3748302. URLhttps: //doi.org/10.1145/3748302. A. Benchmark Construction and Dataset Details A.1. Task Stati...

work page doi:10.1145/3748302 2025