Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

Mingyang Zhao; Runze Zhang; Xiaowei Zhang

read the original abstract

Large language models (LLMs) are increasingly used to simulate human behavior in business, economics, and the social sciences, offering a low-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published behavioral-operations experiments, we assess LLM performance along two dimensions: whether LLM-generated data reproduce the original hypothesis-test outcomes, and whether their full response distributions align with human data, measured by Wasserstein distance. We find that LLMs often replicate hypothesis-level effects, suggesting that they can capture salient decision biases and behavioral regularities. However, their response distributions frequently diverge from human data, even for strong proprietary models, with dispersion mismatch playing an important role. We also examine two lightweight mitigation strategies: chain-of-thought prompting and hyperparameter tuning. Both can reduce distributional misalignment, and appropriate tuning can sometimes allow smaller or open-source models to match or outperform larger proprietary systems.

Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management

discussion (0)