arxiv: 2601.18700 · v2 · submitted 2026-01-26 · 💻 cs.AI

Recognition: no theorem link

TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

Xingyu Sui , Yanyan Zhao , Yulin Hu , Jiahe Guo , Weixiang Zhao , Bing Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords tool-augmented agentsemotional support conversationbenchmarklarge language modelshallucination reductionTEA-Benchdialogue agentsfactual grounding

0 comments

The pith

Tool augmentation improves emotional support agents but gains depend on model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TEA-Bench, an interactive benchmark for tool-enhanced emotional support conversation agents that includes realistic scenarios, an MCP-style tool environment, and process-level metrics for both quality and factual grounding. Experiments across nine large language models show that adding tools generally raises support quality and cuts hallucinations, yet the improvement is strongly tied to model strength. Stronger models deploy tools selectively and effectively, while weaker models show only marginal gains. The work also releases the TEA-Dialog dataset of tool-enhanced dialogues and demonstrates that supervised fine-tuning boosts in-distribution performance but fails to generalize.

Core claim

TEA-Bench evaluates tool-augmented agents in emotional support conversations by supplying realistic emotional scenarios, an MCP-style tool environment for factual grounding, and process-level metrics that jointly measure support quality and hallucination. Experiments on nine LLMs establish that tool augmentation improves emotional support quality and reduces hallucination, with gains that are capacity-dependent: stronger models use tools more selectively and effectively while weaker models benefit only marginally. Supervised fine-tuning on the released TEA-Dialog dataset improves in-distribution results but generalizes poorly.

What carries the argument

TEA-Bench, an interactive benchmark that supplies realistic emotional scenarios, an MCP-style tool environment, and process-level metrics to assess both emotional support quality and factual grounding in tool-augmented agents.

If this is right

Tool use can reduce hallucinations while raising overall quality in multi-turn emotional support dialogues.
Stronger models leverage external tools more selectively and effectively than weaker models.
Supervised fine-tuning on tool-enhanced dialogues improves performance within the training distribution but transfers poorly to new scenarios.
Reliable emotional support agents require effective integration of external tools for factual grounding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could test whether training weaker models explicitly on selective tool-use patterns narrows the capacity gap.
The same MCP-style tool environment might be applied to other dialogue domains that need both affective and instrumental support.
Process-level metrics could be adapted to measure how tool calls alter the balance between emotional expression and factual guidance across turns.

Load-bearing premise

The chosen tool environment and process-level metrics accurately reflect real-world factual grounding and emotional support quality without introducing their own biases or unrealistic constraints.

What would settle it

Re-running the nine-LLM experiments with a different set of tools or alternative process metrics that produces no capacity-dependent pattern or even reverses the observed gains would falsify the central claim.

read the original abstract

Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents. Our code and data can be found in https://github.com/XingYuSSS/TEA-Bench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TEA-Bench, the first interactive benchmark for tool-augmented emotional support dialogue agents. It includes realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess emotional support quality and factual grounding. Experiments on nine LLMs show that tool augmentation generally improves support quality and reduces hallucination, with gains strongly capacity-dependent: stronger models use tools more selectively while weaker models benefit only marginally. The authors release the TEA-Dialog dataset and find that supervised fine-tuning improves in-distribution performance but generalizes poorly.

Significance. If the process-level metrics are validated against human judgments, the benchmark would offer a useful framework for developing reliable tool-enhanced emotional support agents, with the capacity-dependent findings providing practical guidance on model selection. The open release of code and data supports reproducibility.

major comments (2)

[§4 (Experiments)] §4 (Experiments): The central claims of improved emotional support quality and hallucination reduction are presented without any reported statistical significance tests, confidence intervals, or controls for potential confounds such as prompt variations or tool output formatting, weakening the robustness of the capacity-dependent conclusions.
[§3.2 (Process-level Metrics)] §3.2 (Process-level Metrics): The factual grounding and hallucination metrics are load-bearing for the main results, yet the manuscript reports neither correlation with human judgments of support quality nor inter-annotator agreement nor component ablations; this leaves open whether observed gains reflect genuine capability improvements or artifacts of the MCP-style environment's provided facts.

minor comments (2)

[Abstract] Abstract: The acronym 'MCP' appears without expansion on first use; please define it explicitly.
[§5 (Dataset)] §5 (Dataset): The description of TEA-Dialog lacks basic statistics such as number of dialogues, average turns, or tool-call frequency, which would aid interpretation of the SFT generalization results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, agreeing where the manuscript can be strengthened through revision.

read point-by-point responses

Referee: [§4 (Experiments)] §4 (Experiments): The central claims of improved emotional support quality and hallucination reduction are presented without any reported statistical significance tests, confidence intervals, or controls for potential confounds such as prompt variations or tool output formatting, weakening the robustness of the capacity-dependent conclusions.

Authors: We acknowledge that the current manuscript does not include statistical significance tests or confidence intervals. In the revised version we will add paired t-tests (with Bonferroni correction) comparing tool-augmented vs. baseline conditions for each model, report 95% confidence intervals on all aggregate metrics, and include sensitivity analyses using three distinct prompt templates plus standardized tool-output formatting. These additions will directly support the capacity-dependent claims with quantitative robustness checks. revision: yes
Referee: [§3.2 (Process-level Metrics)] §3.2 (Process-level Metrics): The factual grounding and hallucination metrics are load-bearing for the main results, yet the manuscript reports neither correlation with human judgments of support quality nor inter-annotator agreement nor component ablations; this leaves open whether observed gains reflect genuine capability improvements or artifacts of the MCP-style environment's provided facts.

Authors: We agree that explicit validation of the process-level metrics against human judgments is required. In revision we will run a human study on 200 sampled dialogues in which three annotators rate emotional support quality and factual accuracy; we will report Pearson/Spearman correlations with our automatic metrics and inter-annotator agreement (Fleiss' kappa). We will also add component ablations that disable tool access while keeping the same environment, demonstrating that performance gains disappear without tool use and are therefore not artifacts of the provided facts. The MCP design supplies verifiable external knowledge precisely to enable objective hallucination detection, which our metrics exploit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent experimental results

full rationale

This is an empirical benchmark paper that introduces TEA-Bench, an MCP-style tool environment, process-level metrics, and TEA-Dialog dataset, then reports experimental outcomes across nine LLMs. The central claims rest on observed performance differences (tool augmentation benefits being capacity-dependent) rather than any derivation, equation, or prediction that reduces to its own inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The paper releases code and data, making results externally verifiable. The absence of mathematical derivations or uniqueness theorems means none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, free parameters, or invented theoretical entities.

pith-pipeline@v0.9.0 · 5495 in / 972 out tokens · 36288 ms · 2026-05-16T11:03:47.253084+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship
cs.CL 2026-04 unverdicted novelty 6.0

ComPASS creates tool-augmented LLM agents for substantive social support, releases the first personalized benchmark ComPASS-Bench, and fine-tunes ComPASS-Qwen to outperform its base model while matching larger LLMs.