Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

· 2025 · cs.CL · arXiv 2503.20749

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

representative citing papers

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

LLMs match condition-level patterns in a noodle purchase survey but fail to replicate distributional structure, with no model beating a pooled human baseline for purchase quantities.

ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

cs.AI · 2026-05-15 · conditional · novelty 7.0

ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.

Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce

cs.AI · 2026-06-11 · unverdicted · novelty 6.0

A modular two-agent simulation framework enables controlled comparison of conversational e-commerce responders, showing rolling-window memory outperforms intent extraction and targeted fixes reduce failures by 62%.

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

Recon scores reasoning traces via action reconstruction fidelity, achieving 54.7% win rate over post-hoc baselines and up to 70% when used to train synthesis models across four domains.

Simulating Human Memory with Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.

SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

cs.AI · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignment on 42 held-out storefronts.

Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.

Model-Free Assessment of Simulator Fidelity via Quantile Curves

stat.ME · 2025-12-04 · unverdicted · novelty 5.0

A model-free method builds confidence sets for latent parameters to proxy sim-to-real discrepancies and estimates the quantile function of that proxy to produce a distribution-level fidelity profile for simulators.

citing papers explorer

Showing 8 of 8 citing papers.

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level cs.CL · 2026-06-08 · unverdicted · none · ref 17 · internal anchor
LLMs match condition-level patterns in a noodle purchase survey but fail to replicate distributional structure, with no model beating a pooled human baseline for purchase quantities.
ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents cs.AI · 2026-05-15 · conditional · none · ref 11 · internal anchor
ShopGym introduces ShopArena to convert live storefronts into self-contained sandbox shops and ShopGuru to synthesize 224 benchmark tasks, with validation showing structural preservation and positive correlation of agent performance between synthetic and live shops.
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce cs.AI · 2026-06-11 · unverdicted · none · ref 24 · internal anchor
A modular two-agent simulation framework enables controlled comparison of conversational e-commerce responders, showing rolling-window memory outperforms intent extraction and targeted fixes reduce failures by 62%.
Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling cs.CL · 2026-05-26 · unverdicted · none · ref 13 · internal anchor
Recon scores reasoning traces via action reconstruction fidelity, achieving 54.7% win rate over post-hoc baselines and up to 70% when used to train synthesis models across four domains.
Simulating Human Memory with Language Models cs.CL · 2026-05-25 · unverdicted · none · ref 62 · internal anchor
Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.
SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents cs.AI · 2026-05-14 · unverdicted · none · ref 12 · 2 links · internal anchor
SimPersona induces a discrete buyer-type space from clickstreams via VQ-VAE, maps types to LLM persona tokens, fine-tunes agents on traces, and samples from merchant distributions to achieve 78% conversion-rate alignment on 42 held-out storefronts.
Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies cs.CL · 2026-04-17 · unverdicted · none · ref 40 · internal anchor
In real human subjects, AI transparency impacts imperfectly cooperative interactions far more than personality traits, unlike simulations where both are comparably influential.
Model-Free Assessment of Simulator Fidelity via Quantile Curves stat.ME · 2025-12-04 · unverdicted · none · ref 18 · internal anchor
A model-free method builds confidence sets for latent parameters to proxy sim-to-real discrepancies and estimates the quantile function of that proxy to produce a distribution-level fidelity profile for simulators.

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

fields

years

verdicts

representative citing papers

citing papers explorer