RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Jingtong Wu; Jun Wang; Linghua Zhang; Zhisong Zhang

arxiv: 2603.16453 · v2 · pith:3ZHM62E7new · submitted 2026-03-17 · 💻 cs.AI

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Linghua Zhang , Jun Wang , Jingtong Wu , Zhisong Zhang This is my paper

classification 💻 cs.AI

keywords agentslong-horizonretailbenchpolicydecisiondecision-makingenvironmentsevaluating

0 comments

read the original abstract

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play
cs.CL 2026-06 unverdicted novelty 7.0

MAFP applies fictitious play to LLM multi-agent systems to resolve stance entanglement in competitive decision-making, outperforming single-round and multi-round baselines on tournament strength and robustness.