AI generates well-liked but templatic empathic responses

Desmond C. Ong; Emma S. Gueorguieva; Hongli Zhan; Javier Hernandez; Jina Suh; Junyi Jessy Li; Tatiana Lau

arxiv: 2604.08479 · v2 · pith:2MWGHTDLnew · submitted 2026-04-09 · 💻 cs.CL

AI generates well-liked but templatic empathic responses

Emma S. Gueorguieva , Hongli Zhan , Jina Suh , Javier Hernandez , Tatiana Lau , Junyi Jessy Li , Desmond C. Ong This is my paper

Pith reviewed 2026-05-10 17:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords empathyLLM responsesempathic tacticsresponse templateAI-generated empathyhuman vs AIemotional supportdiscourse analysis

0 comments

The pith

Large language models rely on one recurring sequence of empathic tactics in most of their responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to explain why people often rate LLM responses to emotional support requests higher than human-written ones. The authors create a taxonomy of ten specific tactics for showing empathy through language, such as validating feelings or restating the problem. They then examine thousands of responses generated by six different models and compare them to human replies. The analysis reveals that AI outputs follow a consistent pattern of these tactics in 83 to 90 percent of cases, and that pattern fills most of the content in those responses. Human replies employ the same tactics but arrange them in far more varied combinations.

Core claim

LLMs have learned and consistently deploy a well-liked template for expressing empathy. Across two studies totaling more than 4,500 responses, a structured sequence of the ten tactics matches 83 to 90 percent of LLM responses and covers 81 to 92 percent of each matched response. Human-written responses prove more diverse in how they combine the tactics.

What carries the argument

A taxonomy of 10 empathic language tactics assembled into a single recurring template that organizes most AI-generated replies.

If this is right

AI empathic replies will tend to include the same core elements in roughly the same order.
The high coverage rate of the template accounts for why these responses receive strong ratings for empathy.
Human responses draw on a wider range of tactic orders and combinations, producing greater variety.
Training on large human datasets may have steered models toward an average but effective pattern.
Future systems could start from this template and then deliberately vary the order or add extra tactics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users may grow to expect this steady style from AI and notice departures from it.
Repeated exposure to the same structure could eventually make AI empathy feel less personal.
Applying the taxonomy to other AI social tasks such as giving advice would test whether similar templates appear elsewhere.
The finding suggests a broader pattern in which AI produces safe, average versions of complex social language.

Load-bearing premise

The taxonomy of ten tactics captures the main functional parts of empathic language in both AI and human writing without leaving out key differences or introducing systematic bias in how responses are labeled.

What would settle it

A fresh collection of several hundred human empathic responses analyzed with the same taxonomy that matches the AI template at rates above 80 percent would undermine the reported difference in diversity.

read the original abstract

Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs stick to a liked empathy template while humans vary more, but the taxonomy's construction is the part that needs the most scrutiny.

read the letter

The main thing to know is that this paper quantifies how formulaic LLM empathic responses are: a specific sequence of 10 tactics matches 83-90% of the AI outputs across six models and covers most of their content, with a drop to 60-83% on held-out data, while human responses show more structural variety. That contrast is the concrete empirical result here, and it lines up with why people sometimes rate the AI versions higher on empathy scales even if the underlying understanding isn't deeper.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs produce more formulaic empathic responses than humans by consistently deploying a template consisting of a structured sequence of 10 empathic tactics (e.g., validating feelings, paraphrasing). Across two studies with 3,265 AI-generated responses from six models and 1,290 human-written responses, this template matches 83-90% of LLM outputs (60-83% in a held-out sample) and covers 81-92% of their content when matched, while human responses exhibit greater diversity at the discourse-functional level. The work suggests this templatic structure may explain why users rate LLM empathy highly.

Significance. If the taxonomy is shown to be valid and independent, the results would provide a concrete empirical account of why LLM empathic responses are preferred and could inform the design of less formulaic AI systems for emotional support. The large sample sizes, multi-model comparison, and direct human-AI contrast are clear strengths that support the empirical core of the work.

major comments (3)

[Methods (Taxonomy Development)] Methods section on taxonomy development: the paper provides no details on how the 10 empathic tactics were derived, validated against external frameworks, or assessed for inter-rater reliability. This is load-bearing for the central claim because, without evidence that the taxonomy was constructed independently of the LLM responses under study, the reported 83-90% template match rates risk being tautological (i.e., the scheme fits the data from which it was likely extracted).
[Results (Template Matching)] Results section on template matching and held-out sample: the drop to 60-83% match on the held-out sample is presented as evidence of generalizability, but the manuscript does not specify how the held-out set was constructed, whether the taxonomy was frozen prior to its application, or what controls were used for response length, topic, and prompt construction. These omissions directly affect whether the diversity contrast with human responses can be interpreted as substantive rather than an artifact of the classification scheme.
[Study Design and Annotation] Study design and annotation: no statistical tests for the reported percentages, no inter-annotator agreement metrics, and no comparison to established empathy taxonomies from psychology are described. This weakens the assertion that human responses are 'more diverse' rather than simply containing functional moves absent from the 10-tactic set.

minor comments (2)

[Abstract] Abstract: the claim of '2 studies' is stated without clarifying the division of labor between the studies or key controls, which would help readers assess the scope immediately.
[Discussion] Discussion: the implications section could more explicitly address whether the templatic nature might reduce perceived authenticity over repeated interactions, even if users initially rate it highly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's constructive feedback and recommendation for major revision. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we have revised the manuscript to address the concerns raised.

read point-by-point responses

Referee: Methods section on taxonomy development: the paper provides no details on how the 10 empathic tactics were derived, validated against external frameworks, or assessed for inter-rater reliability. This is load-bearing for the central claim because, without evidence that the taxonomy was constructed independently of the LLM responses under study, the reported 83-90% template match rates risk being tautological (i.e., the scheme fits the data from which it was likely extracted).

Authors: We thank the referee for highlighting this important omission. The taxonomy was developed through a grounded theory approach starting with a qualitative analysis of a subset of human-written responses to establish the 10 tactics, drawing on established psychological literature on empathic communication. LLM responses were analyzed only after the taxonomy was fixed. We have now expanded the Methods section to include a full description of this process, including the initial coding scheme, how tactics were refined, and a comparison to external frameworks. Additionally, we report inter-rater reliability from two independent coders on a sample of 200 responses (Cohen's kappa = 0.82). We believe this addresses the concern of tautology by demonstrating the taxonomy's independence and validity. revision: yes
Referee: Results section on template matching and held-out sample: the drop to 60-83% match on the held-out sample is presented as evidence of generalizability, but the manuscript does not specify how the held-out set was constructed, whether the taxonomy was frozen prior to its application, or what controls were used for response length, topic, and prompt construction. These omissions directly affect whether the diversity contrast with human responses can be interpreted as substantive rather than an artifact of the classification scheme.

Authors: We agree that additional methodological details are necessary for interpreting the held-out results. The held-out sample consisted of 20% of the total responses (randomly selected after taxonomy development), with the taxonomy frozen prior to application to this set. We have added this information to the Results section. Regarding controls, all responses were generated or collected using matched prompts and topics across AI and human conditions, and we now report analyses controlling for response length by subsampling to equivalent token distributions. These additions clarify that the lower match rate in the held-out set reflects generalizability rather than overfitting, and the diversity differences persist under these controls. revision: yes
Referee: Study design and annotation: no statistical tests for the reported percentages, no inter-annotator agreement metrics, and no comparison to established empathy taxonomies from psychology are described. This weakens the assertion that human responses are 'more diverse' rather than simply containing functional moves absent from the 10-tactic set.

Authors: We acknowledge these gaps in the original submission. We have added statistical tests (chi-squared tests for differences in tactic usage and template adherence rates, all p < 0.001) to the Results section. Inter-annotator agreement metrics are now reported as noted in the response to the first comment. Furthermore, we have included a new subsection comparing our taxonomy to established ones in psychology, such as the one proposed by Davis (1983) on empathy dimensions and more recent discourse-functional analyses. This comparison shows that our 10 tactics cover core elements while human responses exhibit greater variability in sequencing and combination, supporting the diversity claim beyond just absent moves. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical taxonomy application

full rationale

The paper reports an empirical study that develops a taxonomy of 10 empathic tactics and applies it to annotate and compare n=3265 LLM and n=1290 human responses, identifying a common template sequence that matches 83-90% of LLM outputs. No equations, parameter fits, derivations, or self-citation chains appear in the provided text that would reduce the match/coverage percentages to inputs by construction. The held-out sample (60-83% match) supplies an independent check, and the central claims rest on direct counting and contrast with human diversity rather than tautological redefinition or renaming of known results. While taxonomy construction could in principle introduce bias, the manuscript does not exhibit any of the enumerated circular patterns (self-definitional, fitted-input prediction, load-bearing self-citation, etc.), making the analysis self-contained against its own data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the newly developed 10-tactic taxonomy provides an exhaustive and neutral lens for discourse analysis of empathy; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The taxonomy of 10 empathic language tactics is a valid and sufficiently complete categorization for both AI-generated and human-written responses.
The paper develops and applies this taxonomy as the basis for template discovery without external validation metrics reported in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1351 out tokens · 44399 ms · 2026-05-10T17:54:01.263620+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations
cs.HC 2026-06 unverdicted novelty 6.0

AI companion platforms' hidden response policies in vulnerable conversations are inferred via a new taxonomy and IRL on 48k turns, revealing avoidance of corrective friction.
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

Sycophancy is a boundary failure between social alignment and epistemic integrity, captured by a three-condition framework plus taxonomy of targets, mechanisms, and severity.