pith. machine review for the scientific record. sign in

arxiv: 2605.08374 · v2 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:58 UTC · model grok-4.3

classification 💻 cs.AI
keywords Q-learningprovenance DAGepisodic memoryLLM agentsTD lambdacredit assignmentself-evolving memoryExogenous-Context MDP
0
0 comments X

The pith

By propagating Q-learning credit along provenance DAGs, MemQ enables LLM agents to learn from memory dependency chains rather than isolated experiences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current memory methods for LLM agents evaluate each memory in isolation, missing how one memory helps create future ones. MemQ records these creation dependencies in a directed acyclic graph and applies eligibility traces to spread credit backward along the paths. The credit decays based on graph distance instead of time steps. This setup is modeled as an Exogenous-Context MDP that separates external tasks from internal memory evolution. Experiments across six benchmarks show consistent gains, especially in complex multi-step scenarios.

Core claim

The central claim is that applying TD(λ) eligibility traces to memory Q-values over a provenance DAG improves agent success rates. The DAG captures which memories were used to create new ones, allowing structural proximity to guide credit assignment with decay (γλ)^d. This replaces independent memory updates and leads to superior performance in generalization and online learning on tasks from OS interaction to expert QA.

What carries the argument

The provenance DAG recording dependency chains between memories, combined with TD(λ) eligibility traces applied to memory Q-values for credit propagation based on structural depth.

Load-bearing premise

The provenance DAG accurately captures the dependency chains through which memories enable the creation of future memories, making structural proximity a valid substitute for temporal credit assignment.

What would settle it

An ablation study that randomizes the edges in the provenance DAG while keeping the same memories and retrievals would eliminate the performance gains if the DAG structure is key to the improvement.

Figures

Figures reproduced from arXiv: 2605.08374 by Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li.

Figure 1
Figure 1. Figure 1: High-level and conceptual illustration of MemQ. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The EC-MDP. The state factors into an exogenous task stream [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MemQ Framework Overview. The continuous learning loop features three stages: Retrieve: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success rate under different γ. provenance DAG: the effective credit reach is governed by (γλ) d , where d is the DAG depth (Eq. 6). Yet γ and λ play fundamentally different roles: γ controls the structural horizon by weighting the bootstrap target γQ(mnew) (Eq. 5), while λ controls the empirical horizon by decaying how far each observed TD error propagates (Eq. 6). We sweep each hyperparameter individuall… view at source ↗
Figure 5
Figure 5. Figure 5: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Runtime learning dynamics (success rate vs. epoch) across six benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative success rate (CSR) over epochs across six benchmarks, complementing the [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TD error under different γ on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TD error under different γ on BFCL. B.2 λ Ablation on BFCL [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SR (top row) and TD error (bottom row) under different [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MemQ, a method that augments episodic memory in LLM agents by applying TD(λ) eligibility traces to memory Q-values, with credit propagated backward along a provenance DAG that records retrieval dependencies at memory creation time. Credit decays as (γλ)^d where d is DAG depth, replacing temporal distance. The setting is formalized as an Exogenous-Context MDP (EC-MDP) whose factored transitions separate the exogenous task stream from the endogenous memory store. On six benchmarks (OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, expert QA), MemQ reports the highest success rates in both generalization and runtime learning evaluations, with the largest gains (+5.7 pp) on multi-step tasks producing deep relevant chains and the smallest (+0.77 pp) on single-step tasks.

Significance. If the reported ordering and differential gains hold under controlled conditions, the work supplies a concrete mechanism for credit assignment across memory dependency chains rather than treating memories independently. The alignment between gain magnitude and provenance depth provides direct empirical support for the structural-propagation hypothesis. Public code release is a clear strength that enables verification and extension.

major comments (1)
  1. [§4] §4 (Experiments): the central claim attributes performance gains to TD(λ) over the provenance DAG, yet the manuscript does not report whether the DAG construction procedure (including retrieval logging) is applied identically to all baselines or only to MemQ; if the latter, the comparison confounds the credit-propagation mechanism with differences in memory-graph construction.
minor comments (3)
  1. [Abstract, §3.1] Abstract and §3.1: the EC-MDP factorization is presented as decoupling exogenous and endogenous components, but the text does not explicitly state whether memory retrieval can alter the exogenous task stream within a single step; a one-sentence clarification would remove ambiguity.
  2. [§5] §5 (Parameter study): the interaction plots for γ and λ are useful, but the manuscript should add a short table reporting the exact (γ, λ) pairs used for the main results on each benchmark to aid reproducibility.
  3. [§4.2, figures] Figure captions and §4.2: several success-rate tables list absolute percentages without standard deviations or number of runs; adding these would strengthen the reported ordering.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the credit-assignment contribution, and recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): the central claim attributes performance gains to TD(λ) over the provenance DAG, yet the manuscript does not report whether the DAG construction procedure (including retrieval logging) is applied identically to all baselines or only to MemQ; if the latter, the comparison confounds the credit-propagation mechanism with differences in memory-graph construction.

    Authors: The provenance DAG construction (including retrieval logging at memory creation) is an integral and MemQ-specific component; it is not applied to any baseline. All methods share an identical episodic memory buffer, embedding-based retrieval interface, and memory-creation pipeline. The only difference is that MemQ additionally records provenance edges and applies TD(λ) updates along them, while baselines follow their original independent-memory update rules (standard TD(0) or no eligibility traces). This isolates the structural credit-propagation mechanism. We will revise §4 and the experimental appendix to explicitly document the shared memory interface, confirm that retrieval logging occurs uniformly, and state that the DAG is MemQ-only. If desired, we can also add a controlled ablation in which baselines receive a dummy DAG without credit propagation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MemQ by formalizing an Exogenous-Context MDP and applying standard TD(λ) eligibility traces over a provenance DAG as modeling choices, then reports empirical success rates on six benchmarks. No equation or claim reduces by construction to a fitted parameter renamed as prediction, no self-citation chain is invoked to justify uniqueness or load-bearing assumptions, and the differential gains are presented as direct experimental outcomes rather than tautological redefinitions. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The method rests on two modeling inventions (provenance DAG and Exogenous-Context MDP) plus the standard RL parameters gamma and lambda; no machine-checked proofs or external benchmarks beyond the six reported tasks are mentioned.

free parameters (2)
  • gamma
    Discount factor in the TD(lambda) update; interacts with lambda and DAG depth.
  • lambda
    Eligibility trace decay parameter; controls how far credit propagates along the DAG.
axioms (1)
  • domain assumption The setting can be formalized as an Exogenous-Context MDP whose factored transition decouples the exogenous task stream from the endogenous memory store.
    Invoked to justify the credit-propagation scheme; appears in the abstract description of the formalization.
invented entities (2)
  • Provenance DAG no independent evidence
    purpose: Records which memories were retrieved when each new memory was created, enabling structural credit propagation.
    New data structure introduced to replace temporal distance with DAG depth.
  • Exogenous-Context MDP (EC-MDP) no independent evidence
    purpose: Provides a factored MDP formulation that separates task dynamics from memory dynamics.
    New modeling construct used to ground the eligibility-trace application.

pith-pipeline@v0.9.0 · 5576 in / 1431 out tokens · 48572 ms · 2026-05-13T06:58:27.138808+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Large Language Models Are Semi-Parametric Reinforcement Learning Agents , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  2. [2]

    2026 , eprint=

    Memento 2: Learning by Stateful Reflective Memory , author=. 2026 , eprint=

  3. [3]

    2026 , eprint=

    MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. 2026 , eprint=

  4. [4]

    O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S

    Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2023 , isbn =. doi:10.1145/3586183.3606763 , abstract =

  5. [5]

    2024 , eprint=

    MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

  6. [6]

    Transactions on Machine Learning Research , issn=

    Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  7. [7]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  8. [8]

    Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =....

  9. [9]

    Transactions on Machine Learning Research , issn=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  10. [10]

    Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =. 2024 , isbn =. doi:10....

  11. [11]

    2023 , eprint=

    Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory , author=. 2023 , eprint=

  12. [12]

    2023 , eprint=

    RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models , author=. 2023 , eprint=

  13. [13]

    2016 , eprint=

    Model-Free Episodic Control , author=. 2016 , eprint=

  14. [14]

    Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

    Lin, Zichuan and Zhao, Tianqi and Yang, Guangwen and Zhang, Lintao , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

  15. [15]

    and Singh, Satinder P

    Kearns, Michael J. and Singh, Satinder P. , title =. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory , pages =. 2000 , isbn =

  16. [16]

    2023 , eprint=

    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , author=. 2023 , eprint=

  17. [17]

    and Singh, Satinder P

    Jaakkola, Tommi and Jordan, Michael I. and Singh, Satinder P. , title =. Neural Comput. , month = nov, pages =. 1994 , issue_date =. doi:10.1162/neco.1994.6.6.1185 , abstract =

  18. [18]

    Asynchronous Stochastic Approximation and

    Tsitsiklis, John N , journal=. Asynchronous Stochastic Approximation and

  19. [19]

    Borkar, V. S. and Meyn, S. P. , title =. SIAM J. Control Optim. , month = jan, pages =. 2000 , issue_date =. doi:10.1137/S0363012997331639 , abstract =

  20. [20]

    Neuro-Dynamic Programming , author=

  21. [21]

    and Van Roy, B

    Tsitsiklis, J.N. and Van Roy, B. , journal=. An analysis of temporal-difference learning with function approximation , year=

  22. [22]

    The Annals of Mathematical Statistics , volume=

    A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=

  23. [23]

    Machine Learning , volume=

    Learning to Predict by the Methods of Temporal Differences , author=. Machine Learning , volume=

  24. [24]

    and Sutton, Richard S

    Singh, Satinder P. and Sutton, Richard S. , title =. 1996 , issue_date =. doi:10.1007/BF00114726 , journal =

  25. [25]

    Reinforcement Learning: An Introduction , author=

  26. [26]

    Proceedings of the 31st International Conference on Machine Learning , pages =

    True Online TD(lambda) , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

  27. [27]

    Learning from Delayed Rewards , author=

  28. [28]

    Incremental Multi-Step

    Peng, Jing and Williams, Ronald J , journal=. Incremental Multi-Step

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Safe and Efficient Off-Policy Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Reconciling -Returns with Experience Replay , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    International Conference on Learning Representations , year=

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations , year=

  32. [32]

    International Conference on Machine Learning , year=

    Espeholt, Lasse and Soyer, Hubert and Munos, R. International Conference on Machine Learning , year=

  33. [33]

    2014 , eprint=

    Neural Turing Machines , author=. 2014 , eprint=

  34. [34]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Neural Episodic Control , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  35. [35]

    International Conference on Learning Representations , year=

    Prioritized Experience Replay , author=. International Conference on Learning Representations , year=

  36. [36]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  37. [37]

    Yu Wang and Ryuichi Takanobu and Zhiqi Liang and Yuzhen Mao and Yuanzhe Hu and Julian McAuley and Xiaojian Wu , year=. Mem-. 2509.25911 , archivePrefix=

  38. [38]

    arXiv preprint arXiv:2505.00000 , year=

    Yan, Sikuan and Yang, Xiufeng and Huang, Zuchao and Nie, Ercong and Ding, Zifeng and Li, Zonggen and Ma, Xiaowen and Bi, Jinhe and Kersting, Kristian and Pan, Jeff Z and Sch. arXiv preprint arXiv:2505.00000 , year=

  39. [39]

    2026 , eprint=

    Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks , author=. 2026 , eprint=

  40. [40]

    2026 , eprint=

    Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management , author=. 2026 , eprint=

  41. [41]

    2026 , eprint=

    MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards , author=. 2026 , eprint=

  42. [42]

    2026 , eprint=

    RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. 2026 , eprint=

  43. [43]

    2026 , eprint=

    RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization , author=. 2026 , eprint=

  44. [44]

    2026 , eprint=

    ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation , author=. 2026 , eprint=

  45. [45]

    The Thirteenth International Conference on Learning Representations , year=

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=

  46. [46]

    , booktitle =

    Patil, Shishir G and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The Berkeley Function Calling Leaderboard (. 2025 , editor =

  47. [47]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  48. [48]

    2026 , publisher =

    Gemma Team, Google , title =. 2026 , publisher =

  49. [49]

    2025 , eprint=

    Gemini Robotics: Bringing AI into the Physical World , author=. 2025 , eprint=

  50. [50]

    Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

  51. [51]

    2026 , eprint=

    Memp: Exploring Agent Procedural Memory , author=. 2026 , eprint=

  52. [52]

    2025 , eprint=

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

  53. [53]

    2025 , eprint=

    Memento: Fine-tuning LLM Agents without Fine-tuning LLMs , author=. 2025 , eprint=

  54. [54]

    2026 , eprint=

    What Deserves Memory: Adaptive Memory Distillation for LLM Agents , author=. 2026 , eprint=

  55. [55]

    2026 , eprint=

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. 2026 , eprint=

  56. [56]

    2026 , eprint=

    Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. 2026 , eprint=

  57. [57]

    2026 , eprint=

    HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents , author=. 2026 , eprint=

  58. [58]

    2026 , eprint=

    Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

  59. [59]

    2025 , eprint=

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

  60. [60]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  61. [61]

    2026 , eprint=

    Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

  62. [62]

    2025 , eprint=

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. 2025 , eprint=

  63. [63]

    The Fourteenth International Conference on Learning Representations , year=

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. The Fourteenth International Conference on Learning Representations , year=

  64. [64]

    2025 , eprint=

    MemEvolve: Meta-Evolution of Agent Memory Systems , author=. 2025 , eprint=

  65. [65]

    MemSearcher: Training

    Qianhao Yuan and Jie Lou and Zichao Li and Jiawei Chen and Yaojie Lu and Hongyu Lin and Le Sun and Debing Zhang and Xianpei Han , year=. MemSearcher: Training

  66. [66]

    and Niranjan, Mahesan , year =

    Rummery, G. and Niranjan, Mahesan , year =. On-Line Q-Learning Using Connectionist Systems , journal =

  67. [67]

    2025 , eprint=

    LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. 2025 , eprint=

  68. [68]

    Gonzalez , booktitle=

    Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

  69. [69]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  70. [70]

    MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham. MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational L...