pith. sign in

arxiv: 2606.02470 · v1 · pith:ZG2ENIL3new · submitted 2026-06-01 · 💻 cs.AI

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

Pith reviewed 2026-06-28 14:48 UTC · model grok-4.3

classification 💻 cs.AI
keywords MCP-PersonaLLM agentspersonalized toolsbenchmarkenvironment simulationpersonal applicationsRedditSlack
0
0 comments X

The pith

MCP-Persona shows that state-of-the-art LLM agents struggle significantly with personalized tool use in personal applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MCP-Persona as the first benchmark for evaluating LLM agents on real-world personalized MCP tools across applications such as Reddit, Xiaohongshu, Lark, and Slack. Existing benchmarks focus on generic information-seeking tools and overlook the distinct challenges of tools that interact with individual accounts or local databases. Experiments on various SOTA agents demonstrate their significant struggles with these personalized tools. The benchmark uses environment simulation to expose these limitations and guide future improvements in agent design for practical personal use.

Core claim

MCP-Persona is the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. It encompasses social media platforms like Reddit and Xiaohongshu along with enterprise suites such as Lark and Slack. Through environment simulation, the benchmark shows that state-of-the-art agents exhibit significant struggles with personalized tool use.

What carries the argument

MCP-Persona benchmark, which employs environment simulation to test LLM agents on personalized MCP tools that interact with individual accounts or local databases.

If this is right

  • Agents must develop better mechanisms for managing account-specific permissions and data when using personalized tools.
  • Benchmarking efforts should expand to cover additional personal applications beyond generic information retrieval.
  • Progress on MCP-Persona can directly inform improvements in agent reliability for daily personal and collaborative workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents that underperform here are likely to face similar issues in other domains involving private user data such as email or personal calendars.
  • The simulation approach could be extended to train agents on privacy-preserving personalization before real deployment.
  • Wider adoption of such benchmarks may push developers to prioritize tool-use safety features when handling individual accounts.

Load-bearing premise

The environment simulations used in MCP-Persona accurately capture the real practical challenges of interacting with personalized tools in applications such as Reddit and Slack.

What would settle it

A direct test showing that state-of-the-art agents achieve high success rates on MCP-Persona tasks without the reported struggles, or a comparison where real user interactions diverge substantially from the simulated results.

Figures

Figures reproduced from arXiv: 2606.02470 by Gongyi Zou, Haoting Shi, Jingxing Wang, Jingyi Chai, Peizhi Niu, Shuo Tang, Siheng Chen, Wenhao Wang, Xianghe Pang, Xiyuan Yang, Yanfeng Wang, Yaxin Du.

Figure 1
Figure 1. Figure 1: System overview of MCP-Persona, which is built upon the interaction of Tools , Contexts , and Tasks . For each com￾ponent, we introduce a dedicated method, described in detail as Tool-Traverse (§3.1), Context-Tree (§3.2), and Persona-Gen (§3.3). world personalization while preserving user privacy and evaluation fairness. To address the aforementioned challenges, we introduce MCP-Persona, the first evaluati… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset and tool statistics of MCP-Persona. MCP-Persona comprises a total of 24 MCP servers including 12 personalized servers. It encompasses tasks for both single-server and cross-server (hybrid) scenarios, featuring diverse tool chain and personal context distributions to ensure comprehensive evaluation. distribution of MCP servers. Relying solely on static tools’ documentation is insufficient, as it oft… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of efficiency and performance trade-offs across various models, based on average token count, cost, and step length [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A visualized illustration of how tools and contexts interact in the simulated environments of MCP-Persona. Left: a representative Lark-MCP task where the agent progresses through a series of tool calls that either retrieve contextual information or apply state-modifying operations. Right: the corresponding Lark-MCP context-tree, the structured data accessed and updated by corresponding tools throughout the… view at source ↗
read the original abstract

The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark's crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces MCP-Persona, a benchmark for LLM agents on personalized MCP tools in real-world applications such as Reddit, Xiaohongshu (Rednote), Lark (Feishu), and Slack. It argues that prior benchmarks emphasize generic information-seeking tools and do not address challenges arising from interactions with individual accounts or local databases. The paper reports experiments showing that state-of-the-art agents struggle with personalized tool use in the simulated MCP-Persona environments and releases the benchmark at the provided GitHub repository.

Significance. If the simulated environments faithfully reproduce the practical challenges of real personalized MCP tool interactions, MCP-Persona would fill a clear gap by enabling targeted evaluation of agent limitations in personal and enterprise social applications. This could support more effective development of agents for user-specific tasks that current generic benchmarks overlook.

major comments (1)
  1. [Abstract and benchmark design] Abstract and benchmark design paragraph: The claim that MCP-Persona captures 'practical challenges posed by personal social applications' and that observed agent struggles reflect real limitations is load-bearing for the central contribution, yet no fidelity validation is described (e.g., no comparison of simulated tool responses, state transitions, error rates, or rate-limit behaviors against live APIs for Reddit or Slack). Without such checks, failures may arise from simulation simplifications rather than agent deficiencies.
minor comments (1)
  1. [Abstract] The GitHub URL in the abstract contains a duplicated and malformed link: 'https://github.com/wwh0411/MCP-Persona}{https://github.com/wwh0411/MCP-Persona'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract and benchmark design] Abstract and benchmark design paragraph: The claim that MCP-Persona captures 'practical challenges posed by personal social applications' and that observed agent struggles reflect real limitations is load-bearing for the central contribution, yet no fidelity validation is described (e.g., no comparison of simulated tool responses, state transitions, error rates, or rate-limit behaviors against live APIs for Reddit or Slack). Without such checks, failures may arise from simulation simplifications rather than agent deficiencies.

    Authors: We agree that explicit fidelity validation against live APIs would strengthen the central claim. The current manuscript describes the environments as being built on the MCP standard to reproduce personalized tool interactions (account-specific data, local databases, collaboration features) but does not report side-by-side comparisons of simulated versus live responses, state transitions, error rates, or rate-limit behavior. In the revised manuscript we will add a new subsection under benchmark design that details the simulation construction process, including the sources used (official MCP specifications, public API documentation for Reddit, Xiaohongshu, Lark, and Slack) and the modeling choices for state transitions and error conditions. We will also include an explicit limitations paragraph discussing the absence of live-API validation and the potential impact of any simulation simplifications on the reported agent performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark construction with independent experimental reporting

full rationale

The paper introduces MCP-Persona as a new benchmark for personalized MCP tools and reports experimental outcomes on SOTA agents. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The central claim rests on the benchmark design and observed agent performance rather than reducing to any self-defined quantities or prior author work by construction. This is a standard empirical benchmark paper with self-contained evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the new benchmark construction and the assumption that simulated environments reflect real personalized tool challenges; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Simulated environments can faithfully represent the practical challenges of personalized tool interactions in real applications
    Invoked to justify using the benchmark for evaluation (abstract description of MCP-Persona design)

pith-pipeline@v0.9.1-grok · 5771 in / 1143 out tokens · 21803 ms · 2026-06-28T14:48:28.193970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    emnlp-main.1341/

    URL https://aclanthology.org/2025. emnlp-main.1341/. Wang, W., Shi, H., Yuan, M., Lin, Y ., Tong, P., Zhou, H., Liu, G., Zhao, P., Wang, Y ., and Chen, S. Fedgui: Benchmarking federated gui agents across heterogeneous platforms, devices, and operating systems, 2026. URL https://arxiv.org/abs/2604.14956. Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N...

  2. [2]

    Qwen3 Technical Report

    Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. Wu, M., Zhu, T., Han, H., Tan, C., Zhang, X., and Chen, W. Seal-Tools: Self-Instruct Tool Learning Dataset for Agent Tuning and Detailed Benchmark, May 2024. xAI. Grok 4 Model Card. https://data.x.ai/ 2025-08-20-grok-4-model-card.pdf, 2025. Xu, Z., Soria, A. M., Tan, S., Roy, A...

  3. [3]

    LiTEx: A linguistic taxonomy of explanations for understanding within-label variation in natural language inference

    Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

  4. [4]

    emnlp-main.1611/

    URL https://aclanthology.org/2025. emnlp-main.1611/. 12 MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

  5. [5]

    lark_mcp:im_v1_ message_create

  6. [6]

    lark_mcp:im_v1_ chatMembers_get

  7. [7]

    lark_mcp:im_v1_ message_list

  8. [8]

    Project Meeting 2

    lark_mcp:calendar_v4_ calendarEvent_patch Task: ICML's deadline is coming up soon, so our team needs to hold a project meeting. First, inform everyone in the project group that we're going to hold a meeting, then check whether everyone has confirmed in the group over the past three days, especlally the key people like Xiao Jiekal and Zhang Yanyuan. Also, ...

  9. [9]

    lark_mcp:calendar_v4_ calendarEvent_list

  10. [10]

    inform everyone in the project group

    lark_mcp:calendar_v4_ calendar_list "...inform everyone in the project group.." "...check whether everyone has confrmed in the group over the past three days ..." calendar_id chat_id event_id Event 1 Event 2 member_id "...update all the meeting details..." "...create a follow-up 'Project Meeting 2'..." ContextParameters : Search : Create : Modify : input ...

  11. [11]

    om_f7a1eadb0

    lark_mcp:calendar_v4_ calendarEvent_create Messages Message_1 Message_2 om_e7a34903e.. om_f7a1eadb0.. Figure 4.A visualized illustration of how tools and contexts interact in the simulated environments of MCP-Persona. Left: a representative Lark-MCP task where the agent progresses through a series of tool calls that either retrieve contextual information ...

  12. [12]

    Mingri Technology

    as the LLM judge, following previous research (Mo et al., 2025; Yuan et al., 2025; Du et al., 2026). Tool Simulation Fidelity.For each trace, we replay the recorded input parameters against the simulated tools and compare the resulting response to the real one. Because boolean status flags are unreliable (a real server may return a wrapper-level success w...

  13. [13]

    Key Steps

    Instructions-ToolChains-Context Alignment Annotation Core Objective Ensure strict semantic consistency among the natural language instruction, decoupled context, and ground-truth tool chain, forming a coherent, solvable, and unambiguous personalized task. Key Steps

  14. [14]

    2.Instruction Optimization: • Increase task difficulty (e.g., scale query quantities)

    Task Selection:Pick tasks with clear structure and logical tool chains from the candidate dataset, prioritizing those with linear dependencies. 2.Instruction Optimization: • Increase task difficulty (e.g., scale query quantities). • Refine the instruction structure to align with real usage scenarios. 3.Context Refinement: • Minimize necessary context: Rem...

  15. [15]

    Execution-Based-GT Annotation Core Objective Annotate ground-truth (GT) data to support execution-based evaluation, enabling accurate assessment of the agent’s ability to complete personalized tasks. Annotation Categories & Requirements GT is formatted as a list of checkpoints, with four checkpoint types: 1.Generic Search: • Definition: Web searches using...

  16. [16]

    •save context(path, data): Save context to JSON with indentation

    Module Functions (Must implement all): •load context(path): Dict. •save context(path, data): Save context to JSON with indentation. 18 MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation •get entity by path(data, path, id): Retrieve specific entities. •list entities by path(data, path, filters, limit): Suppo...

  17. [17]

    •Rule:Parse brackets[...]FIRST to extract selectors, THEN split by.outside brackets

    Path Parsing Rules (CRITICAL): • Must handle IDs containing dots (e.g.,user.name@domain.com). •Rule:Parse brackets[...]FIRST to extract selectors, THEN split by.outside brackets. • Example: calendars[user.name].events→ Token 1: calendars / user.name, Token 2: events

  18. [18]

    • Apply filters for wildcard paths; use path-IDs as strict filters for specific paths

    ID Generation & Filtering: • Analyze context structure to generate appropriate IDs (e.g., UUID for events, email-like for calendars). • Apply filters for wildcard paths; use path-IDs as strict filters for specific paths

  19. [19]

    Implementation Details (Parse Logic):You must implement the parse path function following this logic to ensure dot-containing IDs are handled correctly: def parse_path(path: str) -> List[PathToken]: tokens = [] i = 0 while i < len(path): if path[i] == ’.’: i += 1; continue # Logic to find segment end, handling nested brackets # ... (Implementation details...

  20. [20]

    success": bool,

    Function Structure: • Function name MUST be:def analyze response patterns(parameters used): • Return format:{"success": bool, "error": str|None, "result": dict|None}

  21. [21]

    CONTEXT ID

    Dynamic Context Handler Integration: • Import: from dynamic context handler import load context, save context, get/list/create/update/delete entity by path •Loading:Load context at start. Select specific context based onos.environ.get("CONTEXT ID"). •Persisting:After ANY modification (Create/Update/Delete), you MUST callsave context

  22. [22]

    Validate references first

    Tool Operation Logic: •List Tools:Uselist entities by path. Validate references first. Use large limit for pagination. •Create Tools:Usecreate entity by path. Validate parent entities exist (e.g., calendar id). •Update/Delete Tools:Validate entity existence before modification

  23. [23]

    • Reference Check:Use get entity by path to verify that referenced IDs (calendar IDs, folder IDs) actually exist in the current context context

    Input & Reference Validation (CRITICAL): • Validate input format (required fields, types). • Reference Check:Use get entity by path to verify that referenced IDs (calendar IDs, folder IDs) actually exist in the current context context. • Handle special keywords like “primary” by resolving them against the context

  24. [24]

    • Failure: Return specific error messages in the error field

    Response Format: • Success: Match the tool’s expected JSON structure (from examples). • Failure: Return specific error messages in the error field. OUTPUT RULES: • Generate ONLY the Python code. • The code must be ready to execute without markdown formatting. 20