pith. sign in

arxiv: 2606.07392 · v1 · pith:642MKCFKnew · submitted 2026-06-05 · 💻 cs.AI · cs.LG· econ.EM· stat.ML

Online Pandora's Box for Contextual LLM Cascading

Pith reviewed 2026-06-27 21:57 UTC · model grok-4.3

classification 💻 cs.AI cs.LGecon.EMstat.ML
keywords online learningPandora's boxLLM cascadingreservation indexGMM estimationUCB boundsregret boundsAPI selection
0
0 comments X

The pith

Parametric modeling of reservation indices lets an online policy select LLM APIs with dimension-dependent sqrt(T) regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an online contextual Pandora's Box model for LLM cascading in which a decision maker observes a context, sequentially queries APIs that reveal outputs at output-dependent costs, and finally deploys one output while seeing only its downstream reward. Instead of estimating full conditional output and cost distributions, the approach imposes a parametric structure on the contextual reservation index functions induced by Weitzman's policy and estimates them with GMM-type methods paired with UCB-style confidence bounds. Under regularity conditions the resulting policy achieves dimension-dependent tilde O(sqrt(T)) cumulative regret over T periods. A sympathetic reader would care because the method supplies a practical route to cost-aware adaptive querying without requiring complete distributional knowledge of each API.

Core claim

By imposing a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy and combining GMM-type estimation of these indices with UCB-style confidence bounds for both the indices and the shared output-level reward evaluator, the policy achieves dimension-dependent tilde O(sqrt(T)) cumulative regret over a horizon of T periods in the online contextual Pandora's Box problem with output-mediated feedback.

What carries the argument

Parametric contextual reservation index functions induced by Weitzman's policy, estimated via GMM-type methods with UCB bounds.

If this is right

  • The policy avoids estimating full conditional output and cost distributions for each API.
  • It handles the output-mediated feedback structure where only the deployed output's reward is observed.
  • It produces dimension-dependent regret bounds that scale with context dimension rather than the full distribution complexity.
  • It enables adaptive sequential querying and final selection in LLM cascading settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parametric reservation-index approach could be tested on other costly information-acquisition problems that share the Pandora structure.
  • If the parametric form fits real LLM output distributions, the method would reduce the data needed to route requests across multiple models.
  • Empirical runs on production LLM traffic could check whether the observed regret tracks the predicted sqrt(T) scaling.

Load-bearing premise

The contextual reservation index functions admit a parametric structure amenable to GMM-type estimation without requiring full conditional output and cost distributions.

What would settle it

Deploy the learned policy on a sequence of T requests drawn from the same context distribution and measure whether cumulative regret grows like O(sqrt(T)) rather than faster.

read the original abstract

Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period a context is observed; APIs are queried sequentially (revealing outputs and incurring output-dependent costs); one output is then selected and only its downstream reward is observed. Rather than estimating full conditional output/cost distributions, the authors impose a parametric structure on the contextual reservation-index functions induced by Weitzman's policy, estimate them via GMM, combine the estimates with UCB-style confidence bounds on both the indices and a shared output-level reward evaluator, and claim that the resulting policy achieves dimension-dependent ilde O(\sqrt T) cumulative regret over T periods under regularity conditions.

Significance. If the regret bound and the supporting estimation arguments hold, the work supplies a dimension-dependent regret guarantee for a practically relevant feedback structure (output-mediated selection) that differs from classical Pandora's Box models. The direct parametric modeling of reservation indices rather than full distributions is a potentially useful reduction for LLM cascading applications.

major comments (2)
  1. [Abstract] Abstract: the claim that the policy achieves ilde O(\sqrt T) regret is stated without any derivation, proof sketch, or verification that the parametric assumption plus GMM-UCB combination actually delivers the rate; the central claim therefore cannot be assessed from the provided text.
  2. [Abstract] Abstract (and the GMM estimation step): the observed reward is available only for the ultimately selected output, whose selection depends on the realized outputs and the (estimated) reservation indices. This induces endogenous selection into the reward observations. Standard GMM moment conditions derived from the joint (output, cost, reward) distribution will generally be biased unless the paper explicitly constructs selection-corrected moments; no such correction is mentioned, which would render the estimator inconsistent and invalidate the subsequent UCB-based regret analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the structure of our results and estimation procedure.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the policy achieves ilde O(√T) regret is stated without any derivation, proof sketch, or verification that the parametric assumption plus GMM-UCB combination actually delivers the rate; the central claim therefore cannot be assessed from the provided text.

    Authors: The abstract summarizes the main result. The full proof of the dimension-dependent ilde O(√T) regret bound, including verification that the parametric reservation-index model combined with GMM estimation and UCB bounds delivers the rate under the stated regularity conditions, appears in Sections 4 and 5 of the manuscript. We can add a concise proof sketch to the introduction for improved accessibility. revision: partial

  2. Referee: [Abstract] Abstract (and the GMM estimation step): the observed reward is available only for the ultimately selected output, whose selection depends on the realized outputs and the (estimated) reservation indices. This induces endogenous selection into the reward observations. Standard GMM moment conditions derived from the joint (output, cost, reward) distribution will generally be biased unless the paper explicitly constructs selection-corrected moments; no such correction is mentioned, which would render the estimator inconsistent and invalidate the subsequent UCB-based regret analysis.

    Authors: The GMM step estimates parameters of the contextual reservation-index functions using moment conditions constructed from observed outputs and costs in the query phase (revealed unconditionally). The reward, observed only for the finally selected output, is used exclusively in a separate UCB component for the output-level reward evaluator. Consequently the GMM moments do not involve the reward and are unaffected by the endogenous selection induced by the output-mediated feedback. The regret analysis in Section 5 explicitly accounts for this feedback structure. revision: no

Circularity Check

0 steps flagged

No significant circularity; standard GMM/UCB applied to new contextual model

full rationale

The derivation imposes a parametric structure on reservation indices induced by classical Weitzman's policy, then applies GMM-type estimation and UCB bounds to obtain a dimension-dependent regret bound under regularity conditions. No step reduces a claimed prediction to a fitted quantity by the paper's own equations, no self-citation chain bears the central result, and the model is self-contained against external benchmarks (Weitzman policy and standard GMM/UCB). The selection-bias concern raised by the skeptic is a potential consistency issue, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate free parameters or invented entities; the parametric structure on reservation indices is a modeling choice whose details are not supplied.

axioms (1)
  • domain assumption Regularity conditions sufficient for the \tilde O(\sqrt T) regret bound to hold
    Invoked in the abstract to support the main theorem but not specified.

pith-pipeline@v0.9.1-grok · 5758 in / 1230 out tokens · 23135 ms · 2026-06-27T21:57:19.936932+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

125 extracted references · 9 linked inside Pith

  1. [1]

    Extended abstract in the Proceedings of the 17th ACM Conference on Electronic Commerce (EC’16) , year=

    Descending price coordinates approximately efficient search , author=. Extended abstract in the Proceedings of the 17th ACM Conference on Electronic Commerce (EC’16) , year=

  2. [2]

    the Annals of Probability , pages=

    On tail probabilities for martingales , author=. the Annals of Probability , pages=. 1975 , publisher=

  3. [3]

    International Conference on Machine Learning , pages=

    Provably optimal algorithms for generalized linear contextual bandits , author=. International Conference on Machine Learning , pages=. 2017 , organization=

  4. [4]

    Advances in Neural Information Processing Systems , volume=

    Learning in generalized linear contextual bandits with stochastic delays , author=. Advances in Neural Information Processing Systems , volume=

  5. [5]

    User-friendly tail bounds for matrix martingales , author=

  6. [6]

    The Annals of Statistics , volume=

    Optimal aggregation of classifiers in statistical learning , author=. The Annals of Statistics , volume=. 2004 , publisher=

  7. [7]

    Stochastic Systems , volume=

    A linear response bandit problem , author=. Stochastic Systems , volume=. 2013 , publisher=

  8. [8]

    Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

    Contextual bandits with linear payoff functions , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

  9. [9]

    Econometrica , volume=

    Who should be treated? empirical welfare maximization methods for treatment choice , author=. Econometrica , volume=. 2018 , publisher=

  10. [10]

    Self-normalized processes: exponential inequalities, moment bounds and iterated logarithm laws , author=

  11. [11]

    Handbook of econometrics , volume=

    Large sample estimation and hypothesis testing , author=. Handbook of econometrics , volume=. 1994 , publisher=

  12. [12]

    Foundations of computational mathematics , volume=

    User-friendly tail bounds for sums of random matrices , author=. Foundations of computational mathematics , volume=. 2012 , publisher=

  13. [13]

    Management Science , volume=

    Mostly exploration-free algorithms for contextual bandits , author=. Management Science , volume=. 2021 , publisher=

  14. [14]

    Econometrica , volume=

    Generalized method of moments with many weak moment conditions , author=. Econometrica , volume=. 2009 , publisher=

  15. [15]

    Advances in neural information processing systems , volume=

    Parametric bandits: The generalized linear case , author=. Advances in neural information processing systems , volume=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Mathematics of Operations Research , volume=

    Linearly parameterized bandits , author=. Mathematics of Operations Research , volume=. 2010 , publisher=

  18. [18]

    Electronic Communications in Probability , number =

    Joel Tropp , title =. Electronic Communications in Probability , number =. 2011 , doi =

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Scalable generalized linear bandits: Online computation and hashing , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Contextual pandora’s box , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  21. [21]

    arXiv preprint arXiv:2506.00886 , year=

    Toward a Theory of Agents as Tool-Use Decision-Makers , author=. arXiv preprint arXiv:2506.00886 , year=

  22. [22]

    arXiv preprint arXiv:2507.21046 , year=

    A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence , author=. arXiv preprint arXiv:2507.21046 , year=

  23. [23]

    Vector quantile regression: an optimal transport approach , author=

  24. [24]

    The Eleventh International Conference on Learning Representations , year=

    Fast Nonlinear Vector Quantile Regression , author=. The Eleventh International Conference on Learning Representations , year=

  25. [25]

    2011 , publisher=

    Super learner based conditional density estimation with application to marginal structural models , author=. 2011 , publisher=

  26. [26]

    Electronic Journal of Statistics , volume=

    Minimax optimal conditional density estimation under total variation smoothness , author=. Electronic Journal of Statistics , volume=. 2022 , publisher=

  27. [27]

    arXiv preprint arXiv:2507.17951 , year=

    Are LLM Belief Updates Consistent with Bayes' Theorem? , author=. arXiv preprint arXiv:2507.17951 , year=

  28. [28]

    SIAM journal on computing , volume=

    The nonstochastic multiarmed bandit problem , author=. SIAM journal on computing , volume=. 2002 , publisher=

  29. [29]

    arXiv preprint arXiv:2506.16658 , year=

    Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards , author=. arXiv preprint arXiv:2506.16658 , year=

  30. [30]

    Science , volume=

    Prediction-powered inference , author=. Science , volume=. 2023 , publisher=

  31. [31]

    arXiv preprint arXiv:2311.01453 , year=

    Ppi++: Efficient prediction-powered inference , author=. arXiv preprint arXiv:2311.01453 , year=

  32. [32]

    arXiv preprint arXiv:2501.09731 , year=

    Predictions as surrogates: Revisiting surrogate outcomes in the age of ai , author=. arXiv preprint arXiv:2501.09731 , year=

  33. [33]

    Advances in Neural Information Processing Systems , volume=

    Contextual combinatorial multi-armed bandits with volatile arms and submodular reward , author=. Advances in Neural Information Processing Systems , volume=

  34. [34]

    Advances in Neural Information Processing Systems , volume=

    Nearly tight bounds for the continuum-armed bandit problem , author=. Advances in Neural Information Processing Systems , volume=

  35. [35]

    International Conference on Artificial Intelligence and Statistics , pages=

    Contextual combinatorial volatile multi-armed bandit with adaptive discretization , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2020 , organization=

  36. [36]

    arXiv preprint arXiv:2110.02248 , year=

    Contextual combinatorial bandits with changing action sets via gaussian processes , author=. arXiv preprint arXiv:2110.02248 , year=

  37. [37]

    IEEE transactions on information theory , volume=

    Information-theoretic regret bounds for gaussian process optimization in the bandit setting , author=. IEEE transactions on information theory , volume=. 2012 , publisher=

  38. [38]

    Machine learning , volume=

    Regret bounds for sleeping experts and bandits , author=. Machine learning , volume=. 2010 , publisher=

  39. [39]

    Artificial Intelligence and Statistics , pages=

    Sleeping experts and bandits with stochastic action availability and adversarial rewards , author=. Artificial Intelligence and Statistics , pages=. 2009 , organization=

  40. [40]

    International Conference on Computational Learning Theory , pages=

    Online geometric optimization in the bandit setting against an adaptive adversary , author=. International Conference on Computational Learning Theory , pages=. 2004 , organization=

  41. [41]

    International Conference on Machine Learning , pages=

    Improved sleeping bandits with stochastic action sets and adversarial rewards , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  42. [42]

    Algorithmic Learning Theory , pages=

    Adversarial online learning with changing action sets: Efficient algorithms with approximate regret bounds , author=. Algorithmic Learning Theory , pages=. 2021 , organization=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    Model selection in contextual stochastic bandit problems , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Adapting to misspecification in contextual bandits , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    arXiv preprint arXiv:2409.18909 , year=

    Best Arm Identification with Minimal Regret , author=. arXiv preprint arXiv:2409.18909 , year=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    Fast and regret optimal best arm identification: Fundamental limits and low-complexity algorithms , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    arXiv preprint arXiv:2402.10592 , year=

    Optimizing adaptive experiments: A unified approach to regret minimization and best-arm identification , author=. arXiv preprint arXiv:2402.10592 , year=

  48. [48]

    Advances in Neural Information Processing Systems , volume=

    Proportional response: Contextual bandits for simple and cumulative regret minimization , author=. Advances in Neural Information Processing Systems , volume=

  49. [49]

    arXiv preprint arXiv:2211.12004 , year=

    Contextual bandits in a survey experiment on charitable giving: Within-experiment outcomes versus policy learning , author=. arXiv preprint arXiv:2211.12004 , year=

  50. [50]

    2020 , publisher=

    Bandit algorithms , author=. 2020 , publisher=

  51. [51]

    Advances in neural information processing systems , volume=

    Linear contextual bandits with knapsacks , author=. Advances in neural information processing systems , volume=

  52. [52]

    21st Annual Conference on Learning Theory , number=

    Stochastic linear optimization under bandit feedback , author=. 21st Annual Conference on Learning Theory , number=

  53. [53]

    International Conference on Artificial Intelligence and Statistics , pages=

    Contextual bandits with stochastic experts , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2018 , organization=

  54. [54]

    arXiv preprint arXiv:2505.18828 , year=

    Improved Regret and Contextual Linear Extension for Pandora's Box and Prophet Inequality , author=. arXiv preprint arXiv:2505.18828 , year=

  55. [55]

    2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS) , pages=

    Semi-bandit learning for monotone stochastic optimization , author=. 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS) , pages=. 2024 , organization=

  56. [56]

    Advances in neural information processing systems , volume=

    Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=

  57. [57]

    , author=

    OPTIMAL SEARCH FOR THE BEST ALTERNATIVE. , author=. Econometrica , volume=

  58. [58]

    Conference on Learning Theory , pages=

    Generalizing complex hypotheses on product distributions: Auctions, prophet inequalities, and pandora’s problem , author=. Conference on Learning Theory , pages=. 2021 , organization=

  59. [59]

    Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages=

    Bandit algorithms for prophet inequality and pandora's box , author=. Proceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages=. 2024 , organization=

  60. [60]

    Nature medicine , volume=

    Evaluation and mitigation of the limitations of large language models in clinical decision-making , author=. Nature medicine , volume=. 2024 , publisher=

  61. [61]

    NPJ Digital Medicine , volume=

    Large language model integrations in cancer decision-making: a systematic review and meta-analysis , author=. NPJ Digital Medicine , volume=. 2025 , publisher=

  62. [62]

    Nature medicine , volume=

    Large language models in medicine , author=. Nature medicine , volume=. 2023 , publisher=

  63. [63]

    Manufacturing & Service Operations Management , volume=

    A manager and an AI walk into a bar: does ChatGPT make biased decisions like we do? , author=. Manufacturing & Service Operations Management , volume=. 2025 , publisher=

  64. [64]

    Management Science , volume=

    Large language model in creative work: The role of collaboration modality and user expertise , author=. Management Science , volume=. 2024 , publisher=

  65. [65]

    AI in Supply Chains: Perspectives from Global Thought Leaders , pages=

    Large language models for supply chain decisions , author=. AI in Supply Chains: Perspectives from Global Thought Leaders , pages=. 2026 , publisher=

  66. [66]

    Proceedings of the 31st ACM International Conference on Multimedia , pages=

    Against opacity: Explainable ai and large language models for effective digital advertising , author=. Proceedings of the 31st ACM International Conference on Multimedia , pages=

  67. [67]

    Marketing Science , year=

    Applying large language models to sponsored search advertising , author=. Marketing Science , year=

  68. [68]

    Transactions on Machine Learning Research , year=

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance , author=. Transactions on Machine Learning Research , year=

  69. [69]

    arXiv preprint arXiv:2403.12031 , year=

    Routerbench: A benchmark for multi-llm routing system , author=. arXiv preprint arXiv:2403.12031 , year=

  70. [70]

    arXiv preprint arXiv:2306.02561 , year=

    Llm-blender: Ensembling large language models with pairwise ranking and generative fusion , author=. arXiv preprint arXiv:2306.02561 , year=

  71. [71]

    arXiv preprint arXiv:2402.04513 , year=

    Online cascade learning for efficient inference over streams , author=. arXiv preprint arXiv:2402.04513 , year=

  72. [72]

    ACM SIGKDD Explorations Newsletter , volume=

    Omnirouter: Budget and performance controllable multi-llm routing , author=. ACM SIGKDD Explorations Newsletter , volume=. 2025 , publisher=

  73. [73]

    Advances in Neural Information Processing Systems , volume=

    Cascade speculative drafting for even faster llm inference , author=. Advances in Neural Information Processing Systems , volume=

  74. [74]

    Advances in Neural Information Processing Systems , volume=

    Efficient contextual llm cascades through budget-constrained policy learning , author=. Advances in Neural Information Processing Systems , volume=

  75. [75]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Llm-ensemble: Optimal large language model ensemble method for e-commerce product attribute value extraction , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  76. [76]

    Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI , pages=

    Efficient dynamic ensembling for multiple LLM experts , author=. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI , pages=

  77. [77]

    The Twelfth International Conference on Learning Representations , year=

    Large Language Model Cascades with Mixture of Thought Representations for Cost-Efficient Reasoning , author=. The Twelfth International Conference on Learning Representations , year=

  78. [78]

    Advances in Neural Information Processing Systems , volume=

    AutoMix: Automatically mixing language models , author=. Advances in Neural Information Processing Systems , volume=

  79. [79]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Orchestrallm: Efficient orchestration of language models for dialogue state tracking , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  80. [80]

    Proceedings of the Eighteenth European Conference on Computer Systems , pages=

    Tabi: An efficient multi-level inference system for large language models , author=. Proceedings of the Eighteenth European Conference on Computer Systems , pages=

Showing first 80 references.