Predicting Effects, Missing Distributions: Evaluating LLMs as Human Behavior Simulators in Operations Management
read the original abstract
Large language models (LLMs) are increasingly used to simulate human behavior in business, economics, and the social sciences, offering a low-cost complement to laboratory experiments, field studies, and surveys. This paper evaluates how well LLMs replicate human behavior in operations management. Using nine published behavioral-operations experiments, we assess LLM performance along two dimensions: whether LLM-generated data reproduce the original hypothesis-test outcomes, and whether their full response distributions align with human data, measured by Wasserstein distance. We find that LLMs often replicate hypothesis-level effects, suggesting that they can capture salient decision biases and behavioral regularities. However, their response distributions frequently diverge from human data, even for strong proprietary models, with dispersion mismatch playing an important role. We also examine two lightweight mitigation strategies: chain-of-thought prompting and hyperparameter tuning. Both can reduce distributional misalignment, and appropriate tuning can sometimes allow smaller or open-source models to match or outperform larger proprietary systems.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.