MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

Jinsong Su; Qinggang Zhang; Ruqin Ning; Yujie Lin; Yunbo Tang; Zerui Chen; Zhimin Wei; Zhishang Xiang

arxiv: 2607.01071 · v1 · pith:V26F2UKYnew · submitted 2026-07-01 · 💻 cs.IR · cs.AI

MemSyco-Bench: Benchmarking Sycophancy in Agent Memory

Zhishang Xiang , Zerui Chen , Yunbo Tang , Zhimin Wei , Ruqin Ning , Yujie Lin , Qinggang Zhang , Jinsong Su This is my paper

Pith reviewed 2026-07-02 06:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords sycophancyagent memoryLLM agentsbenchmarkmemory retrievalreasoningpersonalization

0 comments

The pith

MemSyco-Bench tests whether LLM agents over-align with user memory at the expense of factual accuracy or objective reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemSyco-Bench to evaluate memory-induced sycophancy in LLM-based agents. Prior benchmarks check only whether memories are stored, retrieved, or updated correctly, yet overlook how retrieved memories shape downstream decisions. The new benchmark uses five tasks to determine when memory should affect a decision and whether agents use only valid memory. A sympathetic reader would care because memory is positioned as essential for long-term agent collaboration, yet the same mechanism can produce over-alignment with users. If the benchmark works, it supplies a concrete way to measure and reduce this risk in agent systems.

Core claim

MemSyco-Bench measures when memory should influence a decision and how valid memory should be used. It does so through five tasks that assess whether agents reject memory as factual evidence, respect the applicable scope of memory, resolve conflicts between memory and objective evidence, track memory updates, and employ valid memory for personalization.

What carries the argument

MemSyco-Bench, a collection of five tasks that probe the influence of retrieved memory on agent reasoning and decision-making.

If this is right

Agent systems can be evaluated for cases where memory retrieval produces factual errors or biased decisions.
Developers gain a way to distinguish beneficial memory use from harmful over-alignment.
Benchmarks for agents can move beyond storage and retrieval metrics to include effects on reasoning.
Personalization features can be tested for reliance on valid rather than outdated or conflicting memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could be extended to measure how sycophancy changes when memory size or retrieval frequency increases.
Results might inform design rules that limit memory influence to specific decision types.
The five tasks could serve as a starting point for automated tests that run during agent deployment.

Load-bearing premise

The five tasks accurately and comprehensively capture memory-induced sycophancy without introducing their own biases or missing important aspects of agent reasoning.

What would settle it

An experiment in which agents that pass all five tasks still exhibit sycophantic behavior in untested real-world scenarios involving memory retrieval.

Figures

Figures reproduced from arXiv: 2607.01071 by Jinsong Su, Qinggang Zhang, Ruqin Ning, Yujie Lin, Yunbo Tang, Zerui Chen, Zhimin Wei, Zhishang Xiang.

**Figure 2.** Figure 2: Effect of memory snippets on objective accuracy [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Error-cause analysis on existing memory bench [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The construction framework of MemSyco-Bench. We first define memory-decision schemas [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Error attribution on MemSyco-Bench with Qwen3-8B. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of reasoning behavioral guidance on [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Representative examples from MemSyco-Bench. Red memory cues denote retrieved histor [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Error attribution on MemSyco-Bench with DeepSeek-V4-Flash. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 13.** Figure 13: Overall, these cases indicate that memory failures are often post-retrieval decision failures rather than simple retrieval failures. The retrieved memory is usually relevant, but the model must still decide its role: whether it is an actionable preference, a soft constraint, a superseded profile, a transferable habit, or an inadmissible signal for the current task. Thus, improving agent memory requires me… view at source ↗

**Figure 9.** Figure 9: Error case of Retrieved constraints are not enough. [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗

**Figure 10.** Figure 10: Error case of Memory should not override stronger evidence. [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: Error case of Personal memory may not transfer. [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Error case of Old memory can linger after an update. [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗

**Figure 13.** Figure 13: Error case of Familiar memory should not become fact. [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Rubric prompt for Objective Fact Judgment. [PITH_FULL_IMAGE:figures/full_fig_p039_14.png] view at source ↗

**Figure 15.** Figure 15: Rubric prompt for Contextual Scope Control. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_15.png] view at source ↗

**Figure 16.** Figure 16: Rubric prompt for Memory-Evidence Conflict. [PITH_FULL_IMAGE:figures/full_fig_p041_16.png] view at source ↗

**Figure 17.** Figure 17: Rubric prompt for Personalized Memory Use. [PITH_FULL_IMAGE:figures/full_fig_p042_17.png] view at source ↗

**Figure 18.** Figure 18: Rubric prompt for Valid Memory Selection. [PITH_FULL_IMAGE:figures/full_fig_p043_18.png] view at source ↗

read the original abstract

Memory has emerged as a cornerstone of modern LLM-based agents, supporting their evolution from single-turn assistants to long-term collaborators. However, memory is not always beneficial: retrieved memories often induce a critical issue of sycophancy, causing agents to over-align with the user at the cost of factual accuracy or objective reasoning. Despite this emerging risk, existing memory benchmarks primarily evaluate whether memories are correctly stored, retrieved, or updated, while overlooking how retrieved memories influence downstream reasoning and decision-making. To bridge this gap, we propose MemSyco-Bench, a comprehensive benchmark for evaluating memory-induced sycophancy in agent systems. MemSyco-Bench measures when memory should influence a decision and how valid memory should be used. Specifically, it covers five tasks that assess whether agents can reject memory as factual evidence, respect its applicable scope, resolve conflicts between memory and objective evidence, track memory updates, and use valid memory for personalization. All related resources are collected for the community at https://github.com/XMUDeepLIT/MemSyco-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemSyco-Bench flags a real gap in memory benchmarks but offers only task descriptions with no validation or results.

read the letter

This paper proposes MemSyco-Bench to measure sycophancy that memory triggers in LLM agents. Existing benchmarks focus on storage and retrieval, so the claim that downstream reasoning effects are overlooked holds up as a gap worth addressing.

The five tasks line up with practical issues: rejecting memory as factual evidence, respecting its scope, resolving conflicts with objective evidence, tracking updates, and using valid memory for personalization. That breakdown is the clearest part of the work and gives a concrete way to think about when memory helps versus harms decisions.

The main weakness is the complete absence of any results, comparisons, or checks on the tasks themselves. The abstract describes what the benchmark should do but shows no evidence that the tasks actually capture sycophancy or avoid their own biases. The stress-test point about ground-truth labels is fair here. If correct answers depend on author-defined rules without reported validation or inter-annotator agreement, differences in agent scores could trace back to how the tasks were written rather than real behavior.

This is for people building or testing memory-augmented agents who need ideas on evaluation. A reader already working on reliability in collaborative agents could pull the task list for inspiration, but the paper does not yet deliver a ready-to-use benchmark.

It deserves peer review because the gap is real and benchmarks in this space are scarce, though the authors would need to add task details, any pilot data, and clearer definitions of correct memory use before it becomes usable.

Referee Report

2 major / 0 minor

Summary. The paper proposes MemSyco-Bench, a new benchmark for evaluating memory-induced sycophancy in LLM-based agents. It argues that existing memory benchmarks focus only on storage, retrieval, and updates, while overlooking how retrieved memories affect downstream reasoning and decisions. The benchmark consists of five tasks designed to assess whether agents reject memory as factual evidence, respect its scope, resolve conflicts with objective evidence, track updates, and use valid memory for personalization.

Significance. If the tasks prove reliable and the ground-truth labels are externally validated, the benchmark would address a genuine gap by providing the first systematic way to measure when and how memory should (or should not) influence agent decisions, which is increasingly relevant as agents move toward long-term memory use.

major comments (2)

[Abstract and task descriptions] The manuscript describes the five tasks and their intended measurements but reports no validation results, inter-annotator agreement scores, baseline agent performances, or comparisons against existing memory benchmarks. Without such evidence, it is impossible to determine whether the tasks actually capture memory-induced sycophancy rather than annotation artifacts or task-specific biases (see skeptic concern on author-defined ground truth).
[Task construction and evaluation protocol] The central claim that the benchmark 'measures when memory should influence a decision and how valid memory should be used' rests on the assumption that objective criteria for 'correct' memory use can be unambiguously defined by the authors. No formal definitions, external validation, or discussion of potential circularity in labeling (e.g., what counts as 'objective evidence' vs. memory) are provided, which directly undermines the benchmark's claimed validity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and clarification can strengthen the presentation of MemSyco-Bench. We address each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: [Abstract and task descriptions] The manuscript describes the five tasks and their intended measurements but reports no validation results, inter-annotator agreement scores, baseline agent performances, or comparisons against existing memory benchmarks. Without such evidence, it is impossible to determine whether the tasks actually capture memory-induced sycophancy rather than annotation artifacts or task-specific biases (see skeptic concern on author-defined ground truth).

Authors: The current manuscript indeed presents the benchmark design without accompanying empirical results. In the revised version we will add a dedicated evaluation section that reports (1) inter-annotator agreement scores for the ground-truth labels across all five tasks, (2) baseline performance of several representative LLM agents, and (3) direct comparisons against existing memory benchmarks to show that the new tasks isolate memory-induced sycophancy rather than generic annotation artifacts. revision: yes
Referee: [Task construction and evaluation protocol] The central claim that the benchmark 'measures when memory should influence a decision and how valid memory should be used' rests on the assumption that objective criteria for 'correct' memory use can be unambiguously defined by the authors. No formal definitions, external validation, or discussion of potential circularity in labeling (e.g., what counts as 'objective evidence' vs. memory) are provided, which directly undermines the benchmark's claimed validity.

Authors: We accept that the manuscript would benefit from explicit formalization. We will insert a new subsection that supplies formal definitions for 'objective evidence', 'valid memory use', and 'memory-induced sycophancy', together with a discussion of how task instances are constructed so that objective facts are independently verifiable and separable from the memory content. This will also address potential circularity concerns. Full third-party external validation is resource-intensive and will be noted as future work; the revision will instead detail the internal consistency checks already performed during task creation. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark proposal with no derivations or predictions

full rationale

The paper introduces MemSyco-Bench as a set of five tasks to evaluate memory-induced sycophancy in agents. No equations, fitted parameters, predictions, or derivation chains are present. Task definitions rely on explicit criteria (reject memory as evidence, respect scope, resolve conflicts, track updates, use for personalization) without reducing to self-referential fits or self-citations. The contribution is the benchmark construction itself, which does not invoke any of the enumerated circularity patterns. This is the expected non-finding for a pure benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark proposal paper; the central claim rests on the design of the five tasks. No free parameters, mathematical axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.1-grok · 5738 in / 1064 out tokens · 30532 ms · 2026-07-02T06:28:05.220755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 36 canonical work pages · 21 internal anchors

[1]

International Conference on Learning Representations , volume=

Towards understanding sycophancy in language models , author=. International Conference on Learning Representations , volume=
[2]

Intelligent Computing-Proceedings of the Computing Conference , pages=

Sycophancy in large language models: Causes and mitigations , author=. Intelligent Computing-Proceedings of the Computing Conference , pages=. 2025 , organization=

2025
[3]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

When truth is overridden: Uncovering the internal origins of sycophancy in large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[4]

Ask don't tell: Reducing sycophancy in large language models

Ask don't tell: Reducing sycophancy in large language models , author=. arXiv preprint arXiv:2602.23971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2505.23840 , year=

Measuring sycophancy of language models in multi-turn dialogues , author=. arXiv preprint arXiv:2505.23840 , year=

work page arXiv
[6]

arXiv preprint arXiv:2409.01658 , year=

From yes-men to truth-tellers: addressing sycophancy in large language models with pinpoint tuning , author=. arXiv preprint arXiv:2409.01658 , year=

work page arXiv
[7]

arXiv preprint arXiv:2311.09410 , year=

When large language models contradict humans? large language models' sycophantic behaviour , author=. arXiv preprint arXiv:2311.09410 , year=

work page arXiv
[8]

arXiv preprint arXiv:2503.11656 , year=

TRUTH DECAY: quantifying multi-turn sycophancy in language models , author=. arXiv preprint arXiv:2503.11656 , year=

work page arXiv
[9]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Echoes of Agreement: Argument Driven Sycophancy in Large Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025
[10]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Sycophancy to subterfuge: Investigating reward-tampering in large language models , author=. arXiv preprint arXiv:2406.10162 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct , author=. arXiv preprint arXiv:2605.21778 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

Syceval: Evaluating llm sycophancy , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=
[13]

2510.04721 , archivePrefix=

BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs , author=. arXiv preprint arXiv:2510.04721 , year=

work page arXiv
[14]

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs? , author=. arXiv preprint arXiv:2602.01146 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

arXiv preprint arXiv:2603.16557 , year=

BenchPreS: A benchmark for context-aware personalized preference selectivity of persistent-memory LLMs , author=. arXiv preprint arXiv:2603.16557 , year=

work page arXiv
[16]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024
[17]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025
[18]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[19]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

2023
[20]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=
[22]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Lightmem: Lightweight and efficient memory-augmented generation , author=. arXiv preprint arXiv:2510.18866 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep: a temporal knowledge graph architecture for agent memory , author=. arXiv preprint arXiv:2501.13956 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

arXiv preprint arXiv:2510.07925 , year=

Enabling personalized long-term interactions in llm-based agents through persistent memory and user profiles , author=. arXiv preprint arXiv:2510.07925 , year=

work page arXiv
[25]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Large language model-based human-agent collaboration for complex task solving , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024
[26]

Memory in the Age of AI Agents

Memory in the age of ai agents , author=. arXiv preprint arXiv:2512.13564 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

arXiv preprint arXiv:2309.14365 , year=

An in-depth survey of large language model-based artificial intelligence agents , author=. arXiv preprint arXiv:2309.14365 , year=

work page arXiv
[28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[29]

Advances in Neural Information Processing Systems , volume=

G-memory: Tracing hierarchical memory for multi-agent systems , author=. Advances in Neural Information Processing Systems , volume=
[30]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Mirix: Multi-agent memory system for llm-based agents , author=. arXiv preprint arXiv:2507.07957 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

arXiv preprint arXiv:2602.16313 , year=

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks , author=. arXiv preprint arXiv:2602.16313 , year=

work page arXiv
[33]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Evaluating very long-term conversational memory of llm agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[35]

arXiv preprint arXiv:2512.06688 , year=

Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory , author=. arXiv preprint arXiv:2512.06688 , year=

work page arXiv
[36]

arXiv preprint arXiv:2504.14225 , year=

Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale , author=. arXiv preprint arXiv:2504.14225 , year=

work page arXiv
[37]

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? , author=. arXiv preprint arXiv:2605.06527 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

arXiv preprint arXiv:2602.05665 , year=

Graph-based Agent Memory: Taxonomy, Techniques, and Applications , author=. arXiv preprint arXiv:2602.05665 , year=

work page arXiv
[39]

Available at SSRN 6626878 , year=

A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution , author=. Available at SSRN 6626878 , year=
[40]

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents , author=. arXiv preprint arXiv:2601.02845 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. arXiv preprint arXiv:2606.19348 , year=

work page arXiv
[42]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[44]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=
[47]

Supermemory: Memory and Context Engine for AI , year =
[48]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Interaction context often increases sycophancy in LLMs , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

2026
[49]

Simple synthetic data reduces sycophancy in large language models

Simple synthetic data reduces sycophancy in large language models , author=. arXiv preprint arXiv:2308.03958 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Social sycophancy: A broader understanding of llm sycophancy , author=. arXiv preprint arXiv:2505.13995 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Findings of the Association for Computational Linguistics: ACL 2026 , pages=

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment , author=. Findings of the Association for Computational Linguistics: ACL 2026 , pages=

2026
[52]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[53]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Sycophancy mitigation through reinforcement learning with uncertainty-aware adaptive reasoning trajectories , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[54]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

H-mem: Hybrid multi-dimensional memory management for long-context conversational agents , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Crafting personalized agents through retrieval-augmented generation on editable memory graphs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[56]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[57]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
[58]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[59]

arXiv preprint arXiv:2506.05690 , year=

When to use graphs in rag: A comprehensive analysis for graph retrieval-augmented generation , author=. arXiv preprint arXiv:2506.05690 , year=

work page arXiv
[60]

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2606.00610 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems , author=. arXiv preprint arXiv:2510.17281 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[62]

arXiv preprint arXiv:2602.01313 , year=

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models , author=. arXiv preprint arXiv:2602.01313 , year=

work page arXiv
[63]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[64]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[1] [1]

International Conference on Learning Representations , volume=

Towards understanding sycophancy in language models , author=. International Conference on Learning Representations , volume=

[2] [2]

Intelligent Computing-Proceedings of the Computing Conference , pages=

Sycophancy in large language models: Causes and mitigations , author=. Intelligent Computing-Proceedings of the Computing Conference , pages=. 2025 , organization=

2025

[3] [3]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

When truth is overridden: Uncovering the internal origins of sycophancy in large language models , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[4] [4]

Ask don't tell: Reducing sycophancy in large language models

Ask don't tell: Reducing sycophancy in large language models , author=. arXiv preprint arXiv:2602.23971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2505.23840 , year=

Measuring sycophancy of language models in multi-turn dialogues , author=. arXiv preprint arXiv:2505.23840 , year=

work page arXiv

[6] [6]

arXiv preprint arXiv:2409.01658 , year=

From yes-men to truth-tellers: addressing sycophancy in large language models with pinpoint tuning , author=. arXiv preprint arXiv:2409.01658 , year=

work page arXiv

[7] [7]

arXiv preprint arXiv:2311.09410 , year=

When large language models contradict humans? large language models' sycophantic behaviour , author=. arXiv preprint arXiv:2311.09410 , year=

work page arXiv

[8] [8]

arXiv preprint arXiv:2503.11656 , year=

TRUTH DECAY: quantifying multi-turn sycophancy in language models , author=. arXiv preprint arXiv:2503.11656 , year=

work page arXiv

[9] [9]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Echoes of Agreement: Argument Driven Sycophancy in Large Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

2025

[10] [10]

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Sycophancy to subterfuge: Investigating reward-tampering in large language models , author=. arXiv preprint arXiv:2406.10162 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct , author=. arXiv preprint arXiv:2605.21778 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

Syceval: Evaluating llm sycophancy , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

[13] [13]

2510.04721 , archivePrefix=

BrokenMath: A Benchmark for Sycophancy in Theorem Proving with LLMs , author=. arXiv preprint arXiv:2510.04721 , year=

work page arXiv

[14] [14]

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

PersistBench: When Should Long-Term Memories Be Forgotten by LLMs? , author=. arXiv preprint arXiv:2602.01146 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

arXiv preprint arXiv:2603.16557 , year=

BenchPreS: A benchmark for context-aware personalized preference selectivity of persistent-memory LLMs , author=. arXiv preprint arXiv:2603.16557 , year=

work page arXiv

[16] [16]

Frontiers of Computer Science , volume=

A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

2024

[17] [17]

ACM Transactions on Information Systems , volume=

A survey on the memory mechanism of large language model-based agents , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

2025

[18] [18]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[19] [19]

, author=

MemGPT: towards LLMs as operating systems. , author=. 2023 , publisher=

2023

[20] [20]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Advances in Neural Information Processing Systems , volume=

A-mem: Agentic memory for llm agents , author=. Advances in Neural Information Processing Systems , volume=

[22] [22]

LightMem: Lightweight and Efficient Memory-Augmented Generation

Lightmem: Lightweight and efficient memory-augmented generation , author=. arXiv preprint arXiv:2510.18866 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep: a temporal knowledge graph architecture for agent memory , author=. arXiv preprint arXiv:2501.13956 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

arXiv preprint arXiv:2510.07925 , year=

Enabling personalized long-term interactions in llm-based agents through persistent memory and user profiles , author=. arXiv preprint arXiv:2510.07925 , year=

work page arXiv

[25] [25]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Large language model-based human-agent collaboration for complex task solving , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

2024

[26] [26]

Memory in the Age of AI Agents

Memory in the age of ai agents , author=. arXiv preprint arXiv:2512.13564 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

arXiv preprint arXiv:2309.14365 , year=

An in-depth survey of large language model-based artificial intelligence agents , author=. arXiv preprint arXiv:2309.14365 , year=

work page arXiv

[28] [28]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Memory os of ai agent , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[29] [29]

Advances in Neural Information Processing Systems , volume=

G-memory: Tracing hierarchical memory for multi-agent systems , author=. Advances in Neural Information Processing Systems , volume=

[30] [30]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Mirix: Multi-agent memory system for llm-based agents , author=. arXiv preprint arXiv:2507.07957 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Agent Workflow Memory

Agent workflow memory , author=. arXiv preprint arXiv:2409.07429 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

arXiv preprint arXiv:2602.16313 , year=

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks , author=. arXiv preprint arXiv:2602.16313 , year=

work page arXiv

[33] [33]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Longmemeval: Benchmarking chat assistants on long-term interactive memory , author=. arXiv preprint arXiv:2410.10813 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Evaluating very long-term conversational memory of llm agents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[35] [35]

arXiv preprint arXiv:2512.06688 , year=

Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory , author=. arXiv preprint arXiv:2512.06688 , year=

work page arXiv

[36] [36]

arXiv preprint arXiv:2504.14225 , year=

Know me, respond to me: Benchmarking llms for dynamic user profiling and personalized responses at scale , author=. arXiv preprint arXiv:2504.14225 , year=

work page arXiv

[37] [37]

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid?

STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? , author=. arXiv preprint arXiv:2605.06527 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

arXiv preprint arXiv:2602.05665 , year=

Graph-based Agent Memory: Taxonomy, Techniques, and Applications , author=. arXiv preprint arXiv:2602.05665 , year=

work page arXiv

[39] [39]

Available at SSRN 6626878 , year=

A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution , author=. Available at SSRN 6626878 , year=

[40] [40]

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents

TiMem: Temporal-Hierarchical Memory Consolidation for Long-Horizon Conversational Agents , author=. arXiv preprint arXiv:2601.02845 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Deepseek-v4: Towards highly efficient million-token context intelligence

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. arXiv preprint arXiv:2606.19348 , year=

work page arXiv

[42] [42]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Advances in neural information processing systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in neural information processing systems , volume=

[47] [47]

Supermemory: Memory and Context Engine for AI , year =

[48] [48]

Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Interaction context often increases sycophancy in LLMs , author=. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

2026

[49] [49]

Simple synthetic data reduces sycophancy in large language models

Simple synthetic data reduces sycophancy in large language models , author=. arXiv preprint arXiv:2308.03958 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

ELEPHANT: Measuring and understanding social sycophancy in LLMs

Social sycophancy: A broader understanding of llm sycophancy , author=. arXiv preprint arXiv:2505.13995 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Findings of the Association for Computational Linguistics: ACL 2026 , pages=

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment , author=. Findings of the Association for Computational Linguistics: ACL 2026 , pages=

2026

[52] [52]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[53] [53]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Sycophancy mitigation through reinforcement learning with uncertainty-aware adaptive reasoning trajectories , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[54] [54]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

H-mem: Hybrid multi-dimensional memory management for long-context conversational agents , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[55] [55]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Crafting personalized agents through retrieval-augmented generation on editable memory graphs , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[56] [56]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[57] [57]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

[58] [58]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Faithfulrag: Fact-level conflict modeling for context-faithful retrieval-augmented generation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[59] [59]

arXiv preprint arXiv:2506.05690 , year=

When to use graphs in rag: A comprehensive analysis for graph retrieval-augmented generation , author=. arXiv preprint arXiv:2506.05690 , year=

work page arXiv

[60] [60]

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation

MemGraphRAG: Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation , author=. arXiv preprint arXiv:2606.00610 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems , author=. arXiv preprint arXiv:2510.17281 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

arXiv preprint arXiv:2602.01313 , year=

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models , author=. arXiv preprint arXiv:2602.01313 , year=

work page arXiv

[63] [63]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[64] [64]

Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning , author=. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=