arxiv: 2604.18800 · v1 · submitted 2026-04-20 · 💻 cs.SI · cs.GT· cs.LG

Recognition: unknown

Optimal Exploration of New Products under Assortment Decisions

Jackie Baek , Atanas Dinev , Thodoris Lykouris

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:50 UTC · model grok-4.3

classification 💻 cs.SI cs.GTcs.LG

keywords assortment optimizationonline learningsocial learningnew product explorationregret minimizationbandit algorithmsplatform operationsexploration-exploitation

0 comments

The pith

It is always optimal to pair new products with top incumbent products in assortments, and the number explored simultaneously follows a threshold on their potential independent of individual purchase probabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how platforms learn the quality of newly listed products through assortment choices. Quality becomes known only when customers buy and review the new item, but these items sell at lower rates than established ones, so exploration reduces immediate revenue. The analysis shows that despite this sales penalty, the regret-minimizing policy always places new products alongside the best incumbent ones rather than offering them in isolation. When several new products are available, the platform should explore a number that rises with their collective promise and does not depend on how likely each one is to be bought on its own. Standard bandit methods fail here, with one over-exploring and the other under-exploring.

Core claim

In a setting where quality information arrives only through purchases that generate reviews, the optimal policy for minimizing long-run regret always includes the highest-revenue incumbent products with each new product being explored. For multiple new products the optimal batch size follows a simple threshold that increases with the new products' overall potential and is independent of their separate purchase probabilities. UCB over-explores while Thompson Sampling under-explores, so neither yields the optimal assortment sequence.

What carries the argument

The social-learning process in which a purchase of a new product produces a review that reveals its quality to the platform and all future customers, used inside a capacity-constrained assortment decision that minimizes cumulative regret.

If this is right

Pairing each new product with the top incumbent products is strictly better for regret than offering the new product alone or with weaker incumbents.
The number of new products to explore together can be computed from their potential alone, without needing their separate purchase probabilities.
Neither UCB nor Thompson Sampling produces the optimal sequence of assortments, so platforms require a tailored policy.
The threshold structure gives a simple, computable rule for deciding how many new items to feature at once.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms could implement this policy by maintaining a running estimate of each new product's potential and adjusting the assortment batch size accordingly.
The independence from individual purchase probabilities may simplify data requirements but assumes the platform can accurately assess potential from limited early signals.
If reviews are noisy or only partially informative, the optimality of pairing might change and would need separate analysis.
The same logic could be tested in other constrained-choice settings such as dynamic pricing or recommendation where information arrives only through costly actions.

Load-bearing premise

Reviews after a purchase fully reveal the new product's true quality to the platform and every future customer, and new products always have lower demand than incumbent ones.

What would settle it

Running the platform's assortment problem in simulation or on historical data and finding that the regret-minimizing policy ever offers a new product without the top incumbents, or that the chosen number of simultaneous new products changes with their individual purchase probabilities, would falsify the claims.

read the original abstract

We study online learning for new products on a platform that makes capacity-constrained assortment decisions on which products to offer. For a newly listed product, its quality is initially unknown, and quality information propagates through social learning: when a customer purchases a new product and leaves a review, its quality is revealed to both the platform and future customers. Since reviews require purchases, the platform must feature new products in the assortment ("explore") to generate reviews to learn about new products. Such exploration is costly because customer demand for new products is lower than for incumbent products. We characterize the optimal assortments for exploration to minimize regret, addressing two questions. (1) Should the platform offer a new product alone or alongside incumbent products? The former maximizes the purchase probability of the new product but yields lower short-term revenue. Despite the lower purchase probability, we show it is always optimal to pair the new product with the top incumbent products. (2) With multiple new products, should the platform explore them simultaneously or one at a time? We show that the optimal number of new products to explore simultaneously has a simple threshold structure: it increases with the "potential" of the new products and, surprisingly, does not depend on their individual purchase probabilities. We also show that two canonical bandit algorithms, UCB and Thompson Sampling, both fail in this setting for opposite reasons: UCB over-explores while Thompson Sampling under-explores. Our results provide structural insights on how platforms should learn about new products through assortment decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript studies online learning for new products on a platform making capacity-constrained assortment decisions. New product quality is initially unknown and revealed to the platform and future customers via reviews triggered by purchases (social learning). Exploration is costly due to lower demand for new products versus incumbents. The authors characterize optimal assortments to minimize regret and address two questions: (1) it is always optimal to pair a new product with top incumbent products rather than offering it alone; (2) the optimal number of new products to explore simultaneously has a threshold structure that increases with their 'potential' and, surprisingly, does not depend on individual purchase probabilities. They further show that UCB over-explores while Thompson Sampling under-explores.

Significance. If the structural results hold under the stated model, the paper offers useful insights for e-commerce platforms on balancing short-term revenue losses against long-term learning gains via assortment decisions. The independence of the exploration threshold from purchase probabilities is a non-obvious finding that could simplify practical implementation. The demonstration that canonical bandit algorithms fail for opposite reasons underscores the need for problem-specific policies when capacity constraints and review-based learning are present. The work is strengthened by its focus on a realistic social-learning mechanism and capacity limits.

major comments (2)

[Abstract and §3] Abstract and main characterization of pairing (likely §3): the claim that it is always optimal to pair the new product with top incumbents despite lower purchase probability rests on the revenue-regret tradeoff separating cleanly. The stress-test concern is valid here—the separation may fail at boundary parameters (very low new-product potential or tight capacity). The manuscript must state the precise conditions and provide the key steps in the proof showing why the tradeoff remains separable.
[§4 and DP formulation] Threshold structure for simultaneous exploration (likely §4 and main DP formulation): the result that the optimal number depends only on 'potential' and is independent of individual purchase probabilities is load-bearing. Purchase probability governs both immediate revenue loss and the rate of quality revelation. The derivation must be shown to rely on perfect one-shot revelation and separable choice probabilities (e.g., independent or logit with fixed outside option) so that the value-of-information term factors linearly and cancels in the threshold condition. Robustness to noisy reviews or non-separable choice models should be discussed, as the skeptic notes this independence may not hold generally.

minor comments (2)

[Model section] The term 'potential' of the new products is used in the threshold result but is not defined in the abstract; it should be introduced with a precise mathematical definition in the model section.
[Numerical results] Any numerical examples or figures illustrating the threshold structure would benefit from explicit sensitivity checks varying purchase probabilities to confirm the claimed independence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the need to clarify assumptions and proof details in our structural results. We address each major comment below and will incorporate clarifications in the revised manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and main characterization of pairing (likely §3): the claim that it is always optimal to pair the new product with top incumbents despite lower purchase probability rests on the revenue-regret tradeoff separating cleanly. The stress-test concern is valid here—the separation may fail at boundary parameters (very low new-product potential or tight capacity). The manuscript must state the precise conditions and provide the key steps in the proof showing why the tradeoff remains separable.

Authors: We agree that the separability of the short-term revenue gain from incumbents and the long-term regret reduction from learning must be made explicit, particularly near boundaries. The proof in §3 establishes that pairing is optimal whenever the new product's potential exceeds the threshold at which exploration has positive value (derived from the capacity constraint and the incumbent quality gap); below this threshold, no exploration occurs. The tradeoff separates because the immediate revenue loss from displacing an incumbent is independent of the new product's purchase probability in the regret calculation, while the information gain scales with it. We will add the precise condition (potential above the minimum exploration threshold) to the abstract and §3, and include the key algebraic steps of the separability argument in an expanded proof appendix. revision: yes
Referee: [§4 and DP formulation] Threshold structure for simultaneous exploration (likely §4 and main DP formulation): the result that the optimal number depends only on 'potential' and is independent of individual purchase probabilities is load-bearing. Purchase probability governs both immediate revenue loss and the rate of quality revelation. The derivation must be shown to rely on perfect one-shot revelation and separable choice probabilities (e.g., independent or logit with fixed outside option) so that the value-of-information term factors linearly and cancels in the threshold condition. Robustness to noisy reviews or non-separable choice models should be discussed, as the skeptic notes this independence may not hold generally.

Authors: The threshold result relies on perfect one-shot revelation (quality revealed fully upon first purchase) and a separable choice model (MNL with fixed outside option), which makes the value-of-information term linear in purchase probability p; this linearity causes p to cancel when comparing the net value of exploring k versus k+1 new products, leaving only the potential parameter. We will expand the DP formulation and derivation in §4 to explicitly state these assumptions and show the cancellation step. We acknowledge that the independence does not necessarily extend to noisy reviews or non-separable models with p-dependent substitution; we will add a limitations paragraph discussing these cases and noting them as directions for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; structural results derived from model optimization

full rationale

The paper sets up a regret-minimization problem for assortment decisions under social learning with unknown new-product qualities. The claimed results (always pair new products with top incumbents; threshold structure for simultaneous exploration independent of individual purchase probabilities) are obtained by solving the resulting dynamic program or characterizing the optimal policy. These are mathematical consequences of the stated demand model, review revelation process, and capacity constraints rather than reductions to fitted inputs, self-definitions, or self-citation chains. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results are present in the abstract or described claims. The derivation is self-contained against the model's primitives.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim rests on the social-learning model via reviews and the assumption of strictly lower demand for new products; no free parameters or invented entities are visible in the abstract.

axioms (2)

domain assumption Quality of a new product is fully revealed to the platform and future customers upon a single purchase and review
Stated directly in the abstract as the propagation mechanism.
domain assumption Customer demand for new products is lower than for incumbent products
Used to establish the exploration cost.

pith-pipeline@v0.9.0 · 5577 in / 1275 out tokens · 49702 ms · 2026-05-10T02:50:00.897714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 6 canonical work pages

[1]

Thompson sampling for the mnl-bandit

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the mnl-bandit. In Conference on learning theory , pages 76--78. PMLR, 2017

2017
[2]

Mnl-bandit: A dynamic learning approach to assortment selection

Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research , 67(5):1453--1485, 2019

2019
[3]

Finite-time analysis of the multiarmed bandit problem

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2):235--256, 2002

2002
[4]

Analysis of thompson sampling for the multi-armed bandit problem

Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory , pages 39--1. JMLR Workshop and Conference Proceedings, 2012

2012
[5]

Learning from reviews: The selection effect and the speed of learning

Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. Learning from reviews: The selection effect and the speed of learning. Econometrica , 90(6):2857--2899, 2022

2022
[6]

Social learning with limited attention: Negative reviews persist under newest first

Jackie Baek, Atanas Dinev, and Thodoris Lykouris. Social learning with limited attention: Negative reviews persist under newest first. arXiv preprint arXiv:2406.06929 , 2024

work page arXiv 2024
[7]

Assortment and price optimization under an endogenous context-dependent multinomial logit model

Yicheng Bai, Omar El Housni, Paat Rusmevichientong, and Huseyin Topaloglu. Assortment and price optimization under an endogenous context-dependent multinomial logit model. Available at SSRN 4534984 , 2023

2023
[8]

Assortment optimization with visibility constraints

Th \'e o Barr \'e , Omar El Housni, and Andrea Lodi. Assortment optimization with visibility constraints. In International Conference on Integer Programming and Combinatorial Optimization , pages 124--138. Springer, 2024

2024
[9]

The fault in our recommendations: On the perils of optimizing the measurable

Omar Besbes, Yash Kanoria, and Akshit Kumar. The fault in our recommendations: On the perils of optimizing the measurable. In Proceedings of the 18th ACM Conference on Recommender Systems , pages 200--208, 2024

2024
[10]

On information distortions in online ratings

Omar Besbes and Marco Scarsini. On information distortions in online ratings. Operations Research , 66(3):597--610, 2018

2018
[11]

Simple pricing schemes for consumers with evolving values

Shuchi Chawla, Nikhil R Devanur, Anna R Karlin, and Balasubramanian Sivan. Simple pricing schemes for consumers with evolving values. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms , pages 1476--1490. SIAM, 2016

2016
[12]

Fair assortment planning

Qinyi Chen, Negin Golrezaei, and Fransisca Susan. Fair assortment planning. arXiv preprint arXiv:2208.07341 , 2022

work page arXiv 2022
[13]

Efficient algorithms for the dynamic pricing problem with reference price effect

Xin Chen, Peng Hu, and Zhenyu Hu. Efficient algorithms for the dynamic pricing problem with reference price effect. Management Science , 63(12):4389--4408, 2017

2017
[14]

Monopoly pricing in the presence of social learning

Davide Crapis, Bar Ifrach, Costis Maglaras, and Marco Scarsini. Monopoly pricing in the presence of social learning. Management Science , 63(11):3586--3608, 2017

2017
[15]

Reviews and self-selection bias with operational implications

Ningyuan Chen, Anran Li, and Kalyan Talluri. Reviews and self-selection bias with operational implications. Management Science , 67(12):7472--7492, 2021

2021
[16]

The effect of word of mouth on sales: Online book reviews

Judith A Chevalier and Dina Mayzlin. The effect of word of mouth on sales: Online book reviews. Journal of marketing research , 43(3):345--354, 2006

2006
[17]

A note on a tight lower bound for capacitated mnl-bandit assortment selection models

Xi Chen and Yining Wang. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters , 46(5):534--537, 2018

2018
[18]

Dynamic assortment optimization with changing contextual information

Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing contextual information. Journal of machine learning research , 21(216):1--44, 2020

2020
[19]

Dynamic pricing with demand learning and reference effects

Arnoud V den Boer and N Bora Keskin. Dynamic pricing with demand learning and reference effects. Management Science , 68(10):7112--7130, 2022

2022
[20]

Capacitated assortment optimization: Hardness and approximation

Antoine D \'e sir, Vineet Goyal, and Jiawei Zhang. Capacitated assortment optimization: Hardness and approximation. Operations Research , 70(2):893--904, 2022

2022
[21]

Pasta: pessimistic assortment optimization

Juncheng Dong, Weibin Mo, Zhengling Qi, Cong Shi, Ethan X Fang, and Vahid Tarokh. Pasta: pessimistic assortment optimization. In International Conference on Machine Learning , pages 8276--8295. PMLR, 2023

2023
[22]

Learning to bid without knowing your value

Zhe Feng, Chara Podimata, and Vasilis Syrgkanis. Learning to bid without knowing your value. In Proceedings of the 2018 ACM Conference on Economics and Computation , pages 505--522, 2018

2018
[23]

Leveraging reviews: Learning to price with buyer and seller uncertainty

Wenshuo Guo, Nika Haghtalab, Kirthevasan Kandasamy, and Ellen Vitercik. Leveraging reviews: Learning to price with buyer and seller uncertainty. ACM SIGecom Exchanges , 22(1):74--82, 2024

2024
[24]

A general attraction model and sales-based linear program for network revenue management under customer choice

Guillermo Gallego, Richard Ratliff, and Sergey Shebalov. A general attraction model and sales-based linear program for network revenue management under customer choice. Operations Research , 63(1):212--232, 2015

2015
[25]

Assortment optimization under the multinomial logit model with covering constraints

Omar El Housni, Qing Feng, and Huseyin Topaloglu. Assortment optimization under the multinomial logit model with covering constraints. arXiv preprint arXiv:2411.10310 , 2024

work page arXiv 2024
[26]

arXiv preprint arXiv:2502.06777 , year=

Yuxuan Han, Han Zhong, Miao Lu, Jose Blanchet, and Zhengyuan Zhou. Learning an optimal assortment policy under observational data. arXiv preprint arXiv:2502.06777 , 2025

work page arXiv 2025
[27]

Bayesian social learning from consumer reviews

Bar Ifrach, Costis Maglaras, Marco Scarsini, and Anna Zseleva. Bayesian social learning from consumer reviews. Operations Research , 67(5):1209--1221, 2019

2019
[28]

On bayesian upper confidence bounds for bandit problems

Emilie Kaufmann, Olivier Capp \'e , and Aur \'e lien Garivier. On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics , pages 592--600. PMLR, 2012

2012
[29]

Vcg mechanism design with unknown agent values under stochastic bandit feedback

Kirthevasan Kandasamy, Joseph E Gonzalez, Michael I Jordan, and Ion Stoica. Vcg mechanism design with unknown agent values under stochastic bandit feedback. arXiv preprint arXiv:2004.08924 , 2020

work page arXiv 2004
[30]

Nearly minimax optimal regret for multinomial logistic bandit

Joongkyu Lee and Min-hwan Oh. Nearly minimax optimal regret for multinomial logistic bandit. Advances in Neural Information Processing Systems , 37:109003--109065, 2024

2024
[31]

arXiv preprint arXiv:2502.10020 , year=

Joongkyu Lee and Min-hwan Oh. Improved online confidence bounds for multinomial logistic bandits. arXiv preprint arXiv:2502.10020 , 2025

work page arXiv 2025
[32]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4--22, 1985

1985
[33]

A simple way towards fair assortment planning: Algorithms and welfare implications

Wentao Lu, Ozge Sahin, and Ruxian Wang. A simple way towards fair assortment planning: Algorithms and welfare implications. Available at SSRN 4514495 , 2023

2023
[34]

Duncan Luce

R. Duncan Luce. Individual choice behavior , volume 4. Wiley New York, 1959

1959
[35]

Conditional logit analysis of qualitative choice behavior

Daniel McFadden. Conditional logit analysis of qualitative choice behavior. 1972

1972
[36]

A. V. Muthukrishnan, Luc Wathieu, and Alison Jing Xu. Ambiguity aversion and the preference for established brands. Management Science , 55(12):1933--1941, 2009

1933
[37]

Thompson sampling for multinomial logit contextual bandits

Min-hwan Oh and Garud Iyengar. Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems , 32, 2019

2019
[38]

Multinomial logit contextual bandits: Provable optimality and practicality

Min-hwan Oh and Garud Iyengar. Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI conference on artificial intelligence , volume 35, pages 9205--9213, 2021

2021
[39]

Dynamic pricing and assortment under a contextual mnl demand

Noemie Perivier and Vineet Goyal. Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems , 35:3461--3474, 2022

2022
[40]

On the complexity of dynamic mechanism design

Christos Papadimitriou, George Pierrakos, Christos-Alexandros Psomas, and Aviad Rubinstein. On the complexity of dynamic mechanism design. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms , pages 1458--1475. SIAM, 2016

2016
[41]

Dynamic pricing strategies with reference effects

Ioana Popescu and Yaozhong Wu. Dynamic pricing strategies with reference effects. Operations research , 55(3):413--429, 2007

2007
[42]

Dynamic assortment optimization with a multinomial logit choice model and capacity constraint

Paat Rusmevichientong, Zuo-Jun Max Shen, and David B Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research , 58(6):1666--1680, 2010

2010
[43]

Assortment optimization under the multinomial logit model with random choice parameters

Paat Rusmevichientong, David Shmoys, Chaoxu Tong, and Huseyin Topaloglu. Assortment optimization under the multinomial logit model with random choice parameters. Production and Operations Management , 23(11):2023--2039, 2014

2023
[44]

Revenue-utility tradeoff in assortment optimization under the multinomial logit model with totally unimodular constraints

Mika Sumida, Guillermo Gallego, Paat Rusmevichientong, Huseyin Topaloglu, and James Davis. Revenue-utility tradeoff in assortment optimization under the multinomial logit model with totally unimodular constraints. Management Science , 67(5):2845--2869, 2021

2021
[45]

Optimal dynamic assortment planning with demand learning

Denis Saur \'e and Assaf Zeevi. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management , 15(3):387--404, 2013

2013
[46]

On the likelihood that one unknown probability exceeds another in view of the evidence of two samples

William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294, 1933

1933
[47]

Revenue management under a general discrete choice model of consumer behavior

Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science , 50(1):15--33, 2004

2004
[48]

Online learning in repeated auctions

Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. In Conference on Learning Theory , pages 1562--1583. PMLR, 2016

2016
[49]

A unified framework to impose market share constraints for selected product classes: Randomized and deterministic assortments under the multinomial logit model

Wenchang Zhu, Paat Rusmevichientong, and Huseyin Topaloglu. A unified framework to impose market share constraints for selected product classes: Randomized and deterministic assortments under the multinomial logit model. Manufacturing & Service Operations Management , 2025

2025
[50]

Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics

Feng Zhu and Xiaoquan Zhang. Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics. Journal of marketing , 74(2):133--148, 2010

2010