Recognition: unknown
Optimal Exploration of New Products under Assortment Decisions
Pith reviewed 2026-05-10 02:50 UTC · model grok-4.3
The pith
It is always optimal to pair new products with top incumbent products in assortments, and the number explored simultaneously follows a threshold on their potential independent of individual purchase probabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a setting where quality information arrives only through purchases that generate reviews, the optimal policy for minimizing long-run regret always includes the highest-revenue incumbent products with each new product being explored. For multiple new products the optimal batch size follows a simple threshold that increases with the new products' overall potential and is independent of their separate purchase probabilities. UCB over-explores while Thompson Sampling under-explores, so neither yields the optimal assortment sequence.
What carries the argument
The social-learning process in which a purchase of a new product produces a review that reveals its quality to the platform and all future customers, used inside a capacity-constrained assortment decision that minimizes cumulative regret.
If this is right
- Pairing each new product with the top incumbent products is strictly better for regret than offering the new product alone or with weaker incumbents.
- The number of new products to explore together can be computed from their potential alone, without needing their separate purchase probabilities.
- Neither UCB nor Thompson Sampling produces the optimal sequence of assortments, so platforms require a tailored policy.
- The threshold structure gives a simple, computable rule for deciding how many new items to feature at once.
Where Pith is reading between the lines
- Platforms could implement this policy by maintaining a running estimate of each new product's potential and adjusting the assortment batch size accordingly.
- The independence from individual purchase probabilities may simplify data requirements but assumes the platform can accurately assess potential from limited early signals.
- If reviews are noisy or only partially informative, the optimality of pairing might change and would need separate analysis.
- The same logic could be tested in other constrained-choice settings such as dynamic pricing or recommendation where information arrives only through costly actions.
Load-bearing premise
Reviews after a purchase fully reveal the new product's true quality to the platform and every future customer, and new products always have lower demand than incumbent ones.
What would settle it
Running the platform's assortment problem in simulation or on historical data and finding that the regret-minimizing policy ever offers a new product without the top incumbents, or that the chosen number of simultaneous new products changes with their individual purchase probabilities, would falsify the claims.
read the original abstract
We study online learning for new products on a platform that makes capacity-constrained assortment decisions on which products to offer. For a newly listed product, its quality is initially unknown, and quality information propagates through social learning: when a customer purchases a new product and leaves a review, its quality is revealed to both the platform and future customers. Since reviews require purchases, the platform must feature new products in the assortment ("explore") to generate reviews to learn about new products. Such exploration is costly because customer demand for new products is lower than for incumbent products. We characterize the optimal assortments for exploration to minimize regret, addressing two questions. (1) Should the platform offer a new product alone or alongside incumbent products? The former maximizes the purchase probability of the new product but yields lower short-term revenue. Despite the lower purchase probability, we show it is always optimal to pair the new product with the top incumbent products. (2) With multiple new products, should the platform explore them simultaneously or one at a time? We show that the optimal number of new products to explore simultaneously has a simple threshold structure: it increases with the "potential" of the new products and, surprisingly, does not depend on their individual purchase probabilities. We also show that two canonical bandit algorithms, UCB and Thompson Sampling, both fail in this setting for opposite reasons: UCB over-explores while Thompson Sampling under-explores. Our results provide structural insights on how platforms should learn about new products through assortment decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies online learning for new products on a platform making capacity-constrained assortment decisions. New product quality is initially unknown and revealed to the platform and future customers via reviews triggered by purchases (social learning). Exploration is costly due to lower demand for new products versus incumbents. The authors characterize optimal assortments to minimize regret and address two questions: (1) it is always optimal to pair a new product with top incumbent products rather than offering it alone; (2) the optimal number of new products to explore simultaneously has a threshold structure that increases with their 'potential' and, surprisingly, does not depend on individual purchase probabilities. They further show that UCB over-explores while Thompson Sampling under-explores.
Significance. If the structural results hold under the stated model, the paper offers useful insights for e-commerce platforms on balancing short-term revenue losses against long-term learning gains via assortment decisions. The independence of the exploration threshold from purchase probabilities is a non-obvious finding that could simplify practical implementation. The demonstration that canonical bandit algorithms fail for opposite reasons underscores the need for problem-specific policies when capacity constraints and review-based learning are present. The work is strengthened by its focus on a realistic social-learning mechanism and capacity limits.
major comments (2)
- [Abstract and §3] Abstract and main characterization of pairing (likely §3): the claim that it is always optimal to pair the new product with top incumbents despite lower purchase probability rests on the revenue-regret tradeoff separating cleanly. The stress-test concern is valid here—the separation may fail at boundary parameters (very low new-product potential or tight capacity). The manuscript must state the precise conditions and provide the key steps in the proof showing why the tradeoff remains separable.
- [§4 and DP formulation] Threshold structure for simultaneous exploration (likely §4 and main DP formulation): the result that the optimal number depends only on 'potential' and is independent of individual purchase probabilities is load-bearing. Purchase probability governs both immediate revenue loss and the rate of quality revelation. The derivation must be shown to rely on perfect one-shot revelation and separable choice probabilities (e.g., independent or logit with fixed outside option) so that the value-of-information term factors linearly and cancels in the threshold condition. Robustness to noisy reviews or non-separable choice models should be discussed, as the skeptic notes this independence may not hold generally.
minor comments (2)
- [Model section] The term 'potential' of the new products is used in the threshold result but is not defined in the abstract; it should be introduced with a precise mathematical definition in the model section.
- [Numerical results] Any numerical examples or figures illustrating the threshold structure would benefit from explicit sensitivity checks varying purchase probabilities to confirm the claimed independence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the need to clarify assumptions and proof details in our structural results. We address each major comment below and will incorporate clarifications in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and main characterization of pairing (likely §3): the claim that it is always optimal to pair the new product with top incumbents despite lower purchase probability rests on the revenue-regret tradeoff separating cleanly. The stress-test concern is valid here—the separation may fail at boundary parameters (very low new-product potential or tight capacity). The manuscript must state the precise conditions and provide the key steps in the proof showing why the tradeoff remains separable.
Authors: We agree that the separability of the short-term revenue gain from incumbents and the long-term regret reduction from learning must be made explicit, particularly near boundaries. The proof in §3 establishes that pairing is optimal whenever the new product's potential exceeds the threshold at which exploration has positive value (derived from the capacity constraint and the incumbent quality gap); below this threshold, no exploration occurs. The tradeoff separates because the immediate revenue loss from displacing an incumbent is independent of the new product's purchase probability in the regret calculation, while the information gain scales with it. We will add the precise condition (potential above the minimum exploration threshold) to the abstract and §3, and include the key algebraic steps of the separability argument in an expanded proof appendix. revision: yes
-
Referee: [§4 and DP formulation] Threshold structure for simultaneous exploration (likely §4 and main DP formulation): the result that the optimal number depends only on 'potential' and is independent of individual purchase probabilities is load-bearing. Purchase probability governs both immediate revenue loss and the rate of quality revelation. The derivation must be shown to rely on perfect one-shot revelation and separable choice probabilities (e.g., independent or logit with fixed outside option) so that the value-of-information term factors linearly and cancels in the threshold condition. Robustness to noisy reviews or non-separable choice models should be discussed, as the skeptic notes this independence may not hold generally.
Authors: The threshold result relies on perfect one-shot revelation (quality revealed fully upon first purchase) and a separable choice model (MNL with fixed outside option), which makes the value-of-information term linear in purchase probability p; this linearity causes p to cancel when comparing the net value of exploring k versus k+1 new products, leaving only the potential parameter. We will expand the DP formulation and derivation in §4 to explicitly state these assumptions and show the cancellation step. We acknowledge that the independence does not necessarily extend to noisy reviews or non-separable models with p-dependent substitution; we will add a limitations paragraph discussing these cases and noting them as directions for future work. revision: partial
Circularity Check
No significant circularity; structural results derived from model optimization
full rationale
The paper sets up a regret-minimization problem for assortment decisions under social learning with unknown new-product qualities. The claimed results (always pair new products with top incumbents; threshold structure for simultaneous exploration independent of individual purchase probabilities) are obtained by solving the resulting dynamic program or characterizing the optimal policy. These are mathematical consequences of the stated demand model, review revelation process, and capacity constraints rather than reductions to fitted inputs, self-definitions, or self-citation chains. No load-bearing self-citations, ansatzes smuggled via prior work, or renaming of known results are present in the abstract or described claims. The derivation is self-contained against the model's primitives.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Quality of a new product is fully revealed to the platform and future customers upon a single purchase and review
- domain assumption Customer demand for new products is lower than for incumbent products
Reference graph
Works this paper leans on
-
[1]
Thompson sampling for the mnl-bandit
Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Thompson sampling for the mnl-bandit. In Conference on learning theory , pages 76--78. PMLR, 2017
2017
-
[2]
Mnl-bandit: A dynamic learning approach to assortment selection
Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research , 67(5):1453--1485, 2019
2019
-
[3]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning , 47(2):235--256, 2002
2002
-
[4]
Analysis of thompson sampling for the multi-armed bandit problem
Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory , pages 39--1. JMLR Workshop and Conference Proceedings, 2012
2012
-
[5]
Learning from reviews: The selection effect and the speed of learning
Daron Acemoglu, Ali Makhdoumi, Azarakhsh Malekian, and Asuman Ozdaglar. Learning from reviews: The selection effect and the speed of learning. Econometrica , 90(6):2857--2899, 2022
2022
-
[6]
Social learning with limited attention: Negative reviews persist under newest first
Jackie Baek, Atanas Dinev, and Thodoris Lykouris. Social learning with limited attention: Negative reviews persist under newest first. arXiv preprint arXiv:2406.06929 , 2024
-
[7]
Assortment and price optimization under an endogenous context-dependent multinomial logit model
Yicheng Bai, Omar El Housni, Paat Rusmevichientong, and Huseyin Topaloglu. Assortment and price optimization under an endogenous context-dependent multinomial logit model. Available at SSRN 4534984 , 2023
2023
-
[8]
Assortment optimization with visibility constraints
Th \'e o Barr \'e , Omar El Housni, and Andrea Lodi. Assortment optimization with visibility constraints. In International Conference on Integer Programming and Combinatorial Optimization , pages 124--138. Springer, 2024
2024
-
[9]
The fault in our recommendations: On the perils of optimizing the measurable
Omar Besbes, Yash Kanoria, and Akshit Kumar. The fault in our recommendations: On the perils of optimizing the measurable. In Proceedings of the 18th ACM Conference on Recommender Systems , pages 200--208, 2024
2024
-
[10]
On information distortions in online ratings
Omar Besbes and Marco Scarsini. On information distortions in online ratings. Operations Research , 66(3):597--610, 2018
2018
-
[11]
Simple pricing schemes for consumers with evolving values
Shuchi Chawla, Nikhil R Devanur, Anna R Karlin, and Balasubramanian Sivan. Simple pricing schemes for consumers with evolving values. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms , pages 1476--1490. SIAM, 2016
2016
-
[12]
Qinyi Chen, Negin Golrezaei, and Fransisca Susan. Fair assortment planning. arXiv preprint arXiv:2208.07341 , 2022
-
[13]
Efficient algorithms for the dynamic pricing problem with reference price effect
Xin Chen, Peng Hu, and Zhenyu Hu. Efficient algorithms for the dynamic pricing problem with reference price effect. Management Science , 63(12):4389--4408, 2017
2017
-
[14]
Monopoly pricing in the presence of social learning
Davide Crapis, Bar Ifrach, Costis Maglaras, and Marco Scarsini. Monopoly pricing in the presence of social learning. Management Science , 63(11):3586--3608, 2017
2017
-
[15]
Reviews and self-selection bias with operational implications
Ningyuan Chen, Anran Li, and Kalyan Talluri. Reviews and self-selection bias with operational implications. Management Science , 67(12):7472--7492, 2021
2021
-
[16]
The effect of word of mouth on sales: Online book reviews
Judith A Chevalier and Dina Mayzlin. The effect of word of mouth on sales: Online book reviews. Journal of marketing research , 43(3):345--354, 2006
2006
-
[17]
A note on a tight lower bound for capacitated mnl-bandit assortment selection models
Xi Chen and Yining Wang. A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters , 46(5):534--537, 2018
2018
-
[18]
Dynamic assortment optimization with changing contextual information
Xi Chen, Yining Wang, and Yuan Zhou. Dynamic assortment optimization with changing contextual information. Journal of machine learning research , 21(216):1--44, 2020
2020
-
[19]
Dynamic pricing with demand learning and reference effects
Arnoud V den Boer and N Bora Keskin. Dynamic pricing with demand learning and reference effects. Management Science , 68(10):7112--7130, 2022
2022
-
[20]
Capacitated assortment optimization: Hardness and approximation
Antoine D \'e sir, Vineet Goyal, and Jiawei Zhang. Capacitated assortment optimization: Hardness and approximation. Operations Research , 70(2):893--904, 2022
2022
-
[21]
Pasta: pessimistic assortment optimization
Juncheng Dong, Weibin Mo, Zhengling Qi, Cong Shi, Ethan X Fang, and Vahid Tarokh. Pasta: pessimistic assortment optimization. In International Conference on Machine Learning , pages 8276--8295. PMLR, 2023
2023
-
[22]
Learning to bid without knowing your value
Zhe Feng, Chara Podimata, and Vasilis Syrgkanis. Learning to bid without knowing your value. In Proceedings of the 2018 ACM Conference on Economics and Computation , pages 505--522, 2018
2018
-
[23]
Leveraging reviews: Learning to price with buyer and seller uncertainty
Wenshuo Guo, Nika Haghtalab, Kirthevasan Kandasamy, and Ellen Vitercik. Leveraging reviews: Learning to price with buyer and seller uncertainty. ACM SIGecom Exchanges , 22(1):74--82, 2024
2024
-
[24]
A general attraction model and sales-based linear program for network revenue management under customer choice
Guillermo Gallego, Richard Ratliff, and Sergey Shebalov. A general attraction model and sales-based linear program for network revenue management under customer choice. Operations Research , 63(1):212--232, 2015
2015
-
[25]
Assortment optimization under the multinomial logit model with covering constraints
Omar El Housni, Qing Feng, and Huseyin Topaloglu. Assortment optimization under the multinomial logit model with covering constraints. arXiv preprint arXiv:2411.10310 , 2024
-
[26]
arXiv preprint arXiv:2502.06777 , year=
Yuxuan Han, Han Zhong, Miao Lu, Jose Blanchet, and Zhengyuan Zhou. Learning an optimal assortment policy under observational data. arXiv preprint arXiv:2502.06777 , 2025
-
[27]
Bayesian social learning from consumer reviews
Bar Ifrach, Costis Maglaras, Marco Scarsini, and Anna Zseleva. Bayesian social learning from consumer reviews. Operations Research , 67(5):1209--1221, 2019
2019
-
[28]
On bayesian upper confidence bounds for bandit problems
Emilie Kaufmann, Olivier Capp \'e , and Aur \'e lien Garivier. On bayesian upper confidence bounds for bandit problems. In Artificial intelligence and statistics , pages 592--600. PMLR, 2012
2012
-
[29]
Vcg mechanism design with unknown agent values under stochastic bandit feedback
Kirthevasan Kandasamy, Joseph E Gonzalez, Michael I Jordan, and Ion Stoica. Vcg mechanism design with unknown agent values under stochastic bandit feedback. arXiv preprint arXiv:2004.08924 , 2020
-
[30]
Nearly minimax optimal regret for multinomial logistic bandit
Joongkyu Lee and Min-hwan Oh. Nearly minimax optimal regret for multinomial logistic bandit. Advances in Neural Information Processing Systems , 37:109003--109065, 2024
2024
-
[31]
arXiv preprint arXiv:2502.10020 , year=
Joongkyu Lee and Min-hwan Oh. Improved online confidence bounds for multinomial logistic bandits. arXiv preprint arXiv:2502.10020 , 2025
-
[32]
Asymptotically efficient adaptive allocation rules
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics , 6(1):4--22, 1985
1985
-
[33]
A simple way towards fair assortment planning: Algorithms and welfare implications
Wentao Lu, Ozge Sahin, and Ruxian Wang. A simple way towards fair assortment planning: Algorithms and welfare implications. Available at SSRN 4514495 , 2023
2023
-
[34]
Duncan Luce
R. Duncan Luce. Individual choice behavior , volume 4. Wiley New York, 1959
1959
-
[35]
Conditional logit analysis of qualitative choice behavior
Daniel McFadden. Conditional logit analysis of qualitative choice behavior. 1972
1972
-
[36]
A. V. Muthukrishnan, Luc Wathieu, and Alison Jing Xu. Ambiguity aversion and the preference for established brands. Management Science , 55(12):1933--1941, 2009
1933
-
[37]
Thompson sampling for multinomial logit contextual bandits
Min-hwan Oh and Garud Iyengar. Thompson sampling for multinomial logit contextual bandits. Advances in Neural Information Processing Systems , 32, 2019
2019
-
[38]
Multinomial logit contextual bandits: Provable optimality and practicality
Min-hwan Oh and Garud Iyengar. Multinomial logit contextual bandits: Provable optimality and practicality. In Proceedings of the AAAI conference on artificial intelligence , volume 35, pages 9205--9213, 2021
2021
-
[39]
Dynamic pricing and assortment under a contextual mnl demand
Noemie Perivier and Vineet Goyal. Dynamic pricing and assortment under a contextual mnl demand. Advances in Neural Information Processing Systems , 35:3461--3474, 2022
2022
-
[40]
On the complexity of dynamic mechanism design
Christos Papadimitriou, George Pierrakos, Christos-Alexandros Psomas, and Aviad Rubinstein. On the complexity of dynamic mechanism design. In Proceedings of the twenty-seventh annual ACM-SIAM symposium on Discrete algorithms , pages 1458--1475. SIAM, 2016
2016
-
[41]
Dynamic pricing strategies with reference effects
Ioana Popescu and Yaozhong Wu. Dynamic pricing strategies with reference effects. Operations research , 55(3):413--429, 2007
2007
-
[42]
Dynamic assortment optimization with a multinomial logit choice model and capacity constraint
Paat Rusmevichientong, Zuo-Jun Max Shen, and David B Shmoys. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research , 58(6):1666--1680, 2010
2010
-
[43]
Assortment optimization under the multinomial logit model with random choice parameters
Paat Rusmevichientong, David Shmoys, Chaoxu Tong, and Huseyin Topaloglu. Assortment optimization under the multinomial logit model with random choice parameters. Production and Operations Management , 23(11):2023--2039, 2014
2023
-
[44]
Revenue-utility tradeoff in assortment optimization under the multinomial logit model with totally unimodular constraints
Mika Sumida, Guillermo Gallego, Paat Rusmevichientong, Huseyin Topaloglu, and James Davis. Revenue-utility tradeoff in assortment optimization under the multinomial logit model with totally unimodular constraints. Management Science , 67(5):2845--2869, 2021
2021
-
[45]
Optimal dynamic assortment planning with demand learning
Denis Saur \'e and Assaf Zeevi. Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management , 15(3):387--404, 2013
2013
-
[46]
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples
William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika , 25(3/4):285--294, 1933
1933
-
[47]
Revenue management under a general discrete choice model of consumer behavior
Kalyan Talluri and Garrett Van Ryzin. Revenue management under a general discrete choice model of consumer behavior. Management Science , 50(1):15--33, 2004
2004
-
[48]
Online learning in repeated auctions
Jonathan Weed, Vianney Perchet, and Philippe Rigollet. Online learning in repeated auctions. In Conference on Learning Theory , pages 1562--1583. PMLR, 2016
2016
-
[49]
A unified framework to impose market share constraints for selected product classes: Randomized and deterministic assortments under the multinomial logit model
Wenchang Zhu, Paat Rusmevichientong, and Huseyin Topaloglu. A unified framework to impose market share constraints for selected product classes: Randomized and deterministic assortments under the multinomial logit model. Manufacturing & Service Operations Management , 2025
2025
-
[50]
Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics
Feng Zhu and Xiaoquan Zhang. Impact of online consumer reviews on sales: The moderating role of product and consumer characteristics. Journal of marketing , 74(2):133--148, 2010
2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.