arxiv: 2604.23022 · v1 · submitted 2026-04-24 · 💻 cs.IR · cs.LG· stat.ML

Recognition: unknown

CASP: Support-Aware Offline Policy Selection for Two-Stage Recommender Systems

Nilson Chapagain

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:03 UTC · model grok-4.3

classification 💻 cs.IR cs.LGstat.ML

keywords two-stage recommender systemsoffline policy selectionsupport-aware estimationdoubly robust methodsconservative selectionrecommender systems

0 comments

The pith

CASP selects two-stage recommender policies by penalizing those that rely on weakly supported generator-item pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of selecting policies for two-stage recommender systems where the first stage generates candidates and the second ranks them. Because the generator determines which data points are available for evaluating the overall policy, standard offline methods can favor policies that seem high-value but are estimated on unreliable support. CASP adds a support-burden penalty to doubly robust value estimates to favor more credible policies. The authors prove that simply choosing the best generator stagewise can lead to arbitrarily poor overall performance and provide theoretical guarantees for their conservative selector under various conditions.

Core claim

CASP combines doubly robust value estimation with a support-burden penalty for selecting from a finite library of two-stage policies. It shows that stagewise rules ignoring downstream continuation value can be arbitrarily suboptimal and derives population, finite-class, and reconstructed-propensity guarantees for conservative selection.

What carries the argument

The support-burden penalty in CASP, which quantifies the reliance on generator-item pairs with low reconstructed propensity and adds it to the value estimate to enforce conservatism.

If this is right

Selected policies will tend to have lower support burden when estimated value and data credibility are in tension.
Two-stage systems require joint consideration of generator and ranker rather than optimizing stages independently.
The approach provides finite-sample guarantees even when using reconstructed propensities instead of true ones.
Simulations and MovieLens experiments demonstrate selection of lower-burden policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar support penalties could improve policy selection in other sequential decision systems with partial observability.
If support estimation is noisy, the method might need additional regularization to remain effective.
Extending CASP to infinite policy classes would require new approximation techniques for the burden term.

Load-bearing premise

Support can be reliably measured using reconstructed propensities from a finite, pre-specified library of policies.

What would settle it

An experiment showing that a policy with high estimated value but high support burden outperforms the CASP-selected policy in a live deployment would indicate the penalty is overly conservative or miscalibrated.

Figures

Figures reproduced from arXiv: 2604.23022 by Nilson Chapagain.

**Figure 1.** Figure 1: CASP as a deployment-facing decision layer. The method does not only rank policies by value; it returns a policy together with support diagnostics that indicate whether the offline value estimate is credible enough to act on. Choosing the penalty. The penalty 𝜆 represents the decision maker’s tolerance for support risk. In the experiments we report a fixed headline value for comparability, but the recomme… view at source ↗

**Figure 2.** Figure 2: Block 2 coupling sweep. Left: true policy value. Right: support burden. As coupling strengthens, view at source ↗

**Figure 3.** Figure 3: Cross-block frontier diagnostics over Blocks 2–5. Left: average value versus average support burden. Right: average view at source ↗

**Figure 4.** Figure 4: Application value–burden frontier for the main view at source ↗

**Figure 5.** Figure 5: Generator-level policy delta between DR-value view at source ↗

**Figure 6.** Figure 6: Cross-block CASP ablation frontier over Blocks 2–5. The normalized full burden is the main ablation baseline; the view at source ↗

read the original abstract

Two-stage recommender systems first choose a candidate generator and then rank items within the generated set. Because the generator decides which items are available to the ranker, changing the generator changes both the policy value and the data support used to estimate that value. This creates an offline selection problem that standard single-stage objectives do not capture: a policy may look good under a retrieval score or a raw off-policy value estimate, but still be unreliable if it depends on weakly supported generator-item pairs. We propose CASP (Coupled Action-Set Pessimism), a support-aware offline selector for finite libraries of two-stage recommender policies. CASP combines doubly robust value estimation with a support-burden penalty. We show that stagewise rules that ignore downstream continuation value can be arbitrarily suboptimal, and we derive population, finite-class, and reconstructed-propensity guarantees for conservative selection. In simulations and a reconstructed MovieLens 1M application, CASP selects lower-burden policies when estimated value and support credibility are in tension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CASP adds a support-burden penalty to doubly robust estimation for two-stage policy selection, but the guarantees hinge on unshown derivations and potentially fragile propensity reconstruction.

read the letter

The core contribution is recognizing that two-stage recommenders create an offline selection problem single-stage methods miss: the generator controls both the items available to the ranker and the support for value estimation. CASP pairs doubly robust estimates with an explicit penalty for policies that lean on weakly supported generator-item pairs, and it shows that ignoring continuation value from the ranker can produce arbitrarily bad selections. That framing is the useful part. The paper also claims population, finite-class, and reconstructed-propensity guarantees for conservative selection, plus a MovieLens simulation where CASP prefers lower-burden policies when value and support estimates pull in different directions. Those elements give the work a clear target audience in recommender systems and offline RL. The soft spots sit in the middle. The abstract states the guarantees without derivation steps, error-bar analysis, or sensitivity checks on the support estimator. The empirical section is described only at a high level, so it is hard to judge how well the penalty actually protects against noisy or biased propensity reconstruction. In two-stage systems the generator directly shapes the ranker’s support, which makes any reconstruction step sensitive to misspecification or sparsity; if that step is even moderately off, the pessimism term may not bound true risk. The finite-library assumption further narrows the result. This paper is for people already working on conservative offline evaluation in retrieval-plus-ranking pipelines. A reader who needs a concrete way to trade off estimated value against support credibility will find the idea worth examining. It deserves a serious referee to check the missing derivations and to see whether the experiments hold up under realistic noise in the propensity estimates.

Referee Report

3 major / 2 minor

Summary. The paper introduces CASP (Coupled Action-Set Pessimism), a method for offline selection among a finite library of two-stage recommender policies. It combines doubly robust value estimation with a support-burden penalty that accounts for the generator's effect on item support available to the ranker. The central claims are that stagewise selection rules ignoring continuation value can be arbitrarily suboptimal, and that CASP provides population, finite-class, and reconstructed-propensity guarantees enabling conservative selection when value estimates conflict with support credibility. Empirical results on simulations and a reconstructed MovieLens 1M dataset show CASP preferring lower-burden policies.

Significance. If the guarantees are valid, the work fills a gap in offline policy evaluation for two-stage systems by explicitly coupling generator-induced support with ranker value estimation. The pessimism mechanism could prevent selection of policies whose apparent performance relies on sparsely observed generator-item pairs. The finite-library assumption and reconstructed-propensity setting are realistic for many production recommenders, but the result's impact hinges on whether the support penalty reliably bounds true risk under realistic misspecification.

major comments (3)

[§4] §4 (theoretical guarantees): the manuscript states population, finite-class, and reconstructed-propensity bounds but supplies no derivation steps, concentration inequalities, or proof sketches for how the support-burden penalty is subtracted from the doubly robust estimator to produce a conservative upper bound on true risk. Without these steps it is impossible to verify that the penalty protects against policies inflated by weakly supported generator-item pairs.
[§3.2] §3.2 (reconstructed-propensity support estimation): the support penalty is defined using propensities reconstructed from the generator; however, no sensitivity analysis or error propagation bound is given showing how generator misspecification, data sparsity, or estimation variance in the reconstruction affects the pessimism term. If reconstruction error is moderate, the penalty may under-penalize high-value but low-support policies, violating the conservative guarantee.
[§2] §2 (suboptimality of stagewise rules): the claim that stagewise rules ignoring downstream continuation value can be arbitrarily suboptimal is illustrated but lacks a general construction or worst-case example that applies beyond the specific two-stage setup; it is therefore unclear whether the suboptimality result is load-bearing for the need for CASP or merely motivational.

minor comments (2)

[§3] Notation for the support-burden penalty and the reconstructed propensity should be introduced with explicit definitions and distinguished from standard propensity scores to avoid confusion with single-stage off-policy estimators.
[§5] The MovieLens experiment description should specify how the generator and ranker policies were constructed from the data and how the finite library was enumerated, including any hyper-parameter choices for the reconstruction step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We respond to each major comment below, indicating planned revisions to the manuscript where appropriate.

read point-by-point responses

Referee: [§4] §4 (theoretical guarantees): the manuscript states population, finite-class, and reconstructed-propensity bounds but supplies no derivation steps, concentration inequalities, or proof sketches for how the support-burden penalty is subtracted from the doubly robust estimator to produce a conservative upper bound on true risk. Without these steps it is impossible to verify that the penalty protects against policies inflated by weakly supported generator-item pairs.

Authors: The complete proofs for the population, finite-class, and reconstructed-propensity guarantees appear in the appendix. To improve accessibility, we will add concise proof sketches to Section 4 of the main text. These sketches will outline the decomposition of the doubly robust estimator, the subtraction of the support-burden penalty, and the concentration inequalities that establish the conservative upper bound on true risk. revision: yes
Referee: [§3.2] §3.2 (reconstructed-propensity support estimation): the support penalty is defined using propensities reconstructed from the generator; however, no sensitivity analysis or error propagation bound is given showing how generator misspecification, data sparsity, or estimation variance in the reconstruction affects the pessimism term. If reconstruction error is moderate, the penalty may under-penalize high-value but low-support policies, violating the conservative guarantee.

Authors: This concern is well-taken. The current reconstructed-propensity guarantee assumes accurate reconstruction. In the revision we will add a sensitivity analysis to Section 3.2 that derives bounds on the effect of reconstruction error, data sparsity, and estimation variance on the pessimism term, under standard bounded-misspecification assumptions. This will clarify the conditions under which the conservative guarantee continues to hold. revision: yes
Referee: [§2] §2 (suboptimality of stagewise rules): the claim that stagewise rules ignoring downstream continuation value can be arbitrarily suboptimal is illustrated but lacks a general construction or worst-case example that applies beyond the specific two-stage setup; it is therefore unclear whether the suboptimality result is load-bearing for the need for CASP or merely motivational.

Authors: The example in Section 2 is constructed to exhibit arbitrary suboptimality arising from the generator–ranker interaction that is central to two-stage systems. We view it as load-bearing for motivating CASP. We will revise the section to present the construction more explicitly as a worst-case within the two-stage policy class and add a short discussion clarifying its scope and relevance to the paper’s setting. revision: partial

Circularity Check

0 steps flagged

No circularity: CASP guarantees derived independently from standard off-policy techniques

full rationale

The paper derives population, finite-class, and reconstructed-propensity guarantees for conservative selection by combining doubly robust value estimation with an externally defined support-burden penalty. No equations reduce these guarantees to fitted inputs by construction, and the support penalty is not defined in terms of the value estimate itself. The finite-library assumption is stated explicitly as a precondition rather than smuggled in via self-citation or ansatz. The central claims rest on first-principles derivations for two-stage systems rather than renaming known results or importing uniqueness theorems from the authors' prior work. This is a standard adaptation of offline RL techniques and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the support-burden penalty and reconstructed-propensity estimator are introduced but not parameterized in the provided text.

pith-pipeline@v0.9.0 · 5477 in / 1041 out tokens · 50064 ms · 2026-05-08T10:03:22.679879+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 18 canonical work pages · 1 internal anchor

[1]

Aman Agarwal, Keisuke Takatsu, Ilya Zaitsev, and Thorsten Joachims. 2018. Counterfactual Learning-to-Rank for Additive Metrics and Deep Models. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

2018
[2]

Charles, D

Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. InJournal of Machine Learning Research: Workshop and Conference Proceedings, Vol. 30. 320–327

2013
[3]

Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Fran Belletti, and Ed H. Chi. 2019. Top-K Off-Policy Correction for a REINFORCE Recommender System. InProceedings of the Twelfth ACM International Conference on Web Search and Data Mining (WSDM). 456–464. doi:10.1145/3289600.3290999

work page doi:10.1145/3289600.3290999 2019
[4]

McAuley, Dietmar Jannach, and Lina Yao

Xiaocong Chen, Siyu Wang, Julian J. McAuley, Dietmar Jannach, and Lina Yao
[5]

doi:10.1145/3661996

On the Opportunities and Challenges of Offline Reinforcement Learning for Recommender Systems.ACM Transactions on Information Systems42, 6 (2024), 150:1–150:26. doi:10.1145/3661996

work page doi:10.1145/3661996 2024
[6]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems (RecSys). Association for Computing Machinery, New York, NY, USA, 191–198. doi:10.1145/2959100.2959190

work page doi:10.1145/2959100.2959190 2016
[7]

Miroslav Dudík, Dumitru Erhan, John Langford, and Lihong Li. 2014. Doubly Robust Policy Evaluation and Optimization.Statist. Sci.29, 4 (2014), 485–511. doi:10.1214/14-STS500

work page doi:10.1214/14-sts500 2014
[8]

Chongming Gao et al . 2024. Causal Inference in Recommender Systems: A Survey and Future Directions.ACM Transactions on Information Systems(2024). doi:10.1145/3639048

work page doi:10.1145/3639048 2024
[9]

Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. 2018. Offline A/B Testing for Recommender Systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (WSDM). Association for Computing Machinery, New York, NY, USA, 198–206. doi:10.1145/3159652.3159687

work page doi:10.1145/3159652.3159687 2018
[10]

Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. InProceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM). IEEE, 263–272. doi:10.1109/ICDM.2008.22

work page doi:10.1109/icdm.2008.22 2008
[11]

Eugene Ie, Chih-Wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. 2019. Reinforcement Learning for Slate-Based Recommender Systems: A Tractable Decomposition and Practical Methodology. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI). 2592–2599. doi:10.24963/i...

work page doi:10.24963/ijcai.2019/360 2019
[12]

Olivier Jeunen and Bart Goethals. 2021. Pessimistic Reward Models for Off- Policy Learning in Recommendation. InProceedings of the 15th ACM Conference on Recommender Systems (RecSys). Association for Computing Machinery, New York, NY, USA, 63–74. doi:10.1145/3460231.3474247 Nilson Chapagain

work page doi:10.1145/3460231.3474247 2021
[13]

Olivier Jeunen and Bart Goethals. 2023. Pessimistic Decision-Making for Recom- mender Systems.ACM Transactions on Recommender Systems1, 1 (2023), 1–27. doi:10.1145/3568029

work page doi:10.1145/3568029 2023
[14]

Olivier Jeunen, Jatin Mandav, Ivan Potapov, Nakul Agarwal, Sourabh Vaid, Wen- zhe Shi, and Aleksei Ustimenko. 2024. Multi-Objective Recommendation via Multivariate Policy Learning. InProceedings of the 18th ACM Conference on Rec- ommender Systems (RecSys). Association for Computing Machinery, New York, NY, USA, 712–721. doi:10.1145/3640457.3688132

work page doi:10.1145/3640457.3688132 2024
[15]

Thorsten Joachims, Adith Swaminathan, and Maarten de Rijke. 2018. Deep Learning with Logged Bandit Feedback. InInternational Conference on Learning Representations (ICLR)

2018
[16]

Thorsten Joachims, Adith Swaminathan, and Tobias Schnabel. 2017. Unbiased Learning-to-Rank with Biased Feedback. InProceedings of the Tenth ACM Interna- tional Conference on Web Search and Data Mining (WSDM). Association for Com- puting Machinery, New York, NY, USA, 781–789. doi:10.1145/3018661.3018699

work page doi:10.1145/3018661.3018699 2017
[17]

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline Re- inforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv preprint arXiv:2005.01643(2020)

work page internal anchor Pith review arXiv 2020
[18]

Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. 2011. Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algo- rithms. InProceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM). Association for Computing Machinery, New York, NY, USA, 297–306. doi:10.1145/1935826.1935878

work page doi:10.1145/1935826.1935878 2011
[19]

Huishi Luo, Fuzhen Zhuang, Ruobing Xie, Hengshu Zhu, Deqing Wang, Zhulin An, and Yongjun Xu. 2024. A Survey on Causal Inference for Recommendation. The Innovation5, 3 (2024)

2024
[20]

Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H. Chi. 2020. Off-Policy Learning in Two-Stage Recommender Systems. InProceedings of The Web Conference 2020 (WWW). Association for Computing Machinery, New York, NY, USA, 463–473. doi:10.1145/3366423.3380130

work page doi:10.1145/3366423.3380130 2020
[21]

James McInerney, Brian Brost, Praveen Chandar, Rishabh Mehrotra, and Ben Carterette. 2020. Counterfactual Evaluation of Slate Recommendations with Se- quential Reward Interactions. InProceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). Association for Computing Machinery, New York, NY, USA, 1779–1788

2020
[22]

Jiaxi Peng, Hao Zou, Jiacheng Liu, Shilong Li, Yichi Jiang, Jian Pei, and Peng Cui
[23]

InProceedings of The Web Conference 2023 (WWW)

Offline Policy Evaluation in Large Action Spaces via Outcome-Oriented Action Grouping. InProceedings of The Web Conference 2023 (WWW). 1220–1230

2023
[24]

Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt- Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. InProceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelli- gence (UAI). AUAI Press, 452–461

2009
[25]

Noveen Sachdeva, Yi Su, and Thorsten Joachims. 2020. Off-Policy Bandits with Deficient Support. InProceedings of the 26th ACM SIGKDD Conference on Knowl- edge Discovery and Data Mining (KDD). Association for Computing Machinery, New York, NY, USA, 965–975

2020
[26]

Noveen Sachdeva, Lequn Wang, Dawen Liang, Nathan Kallus, and Julian McAuley. 2024. Off-Policy Evaluation for Large Action Spaces via Policy Convo- lution. InProceedings of The Web Conference 2024 (WWW). Association for Com- puting Machinery, New York, NY, USA, 3576–3585. doi:10.1145/3589334.3645501

work page doi:10.1145/3589334.3645501 2024
[27]

Yuta Saito et al. 2021. Open Bandit Pipeline: A Pipeline for Reproducible Off- Policy Evaluation and Learning.Journal of Machine Learning Research22 (2021), 1–37

2021
[28]

Yuta Saito, Shunsuke Aihara, Megumi Matsutani, and Yusuke Narita. 2021. Open Bandit Dataset and Pipeline: Towards Realistic and Reproducible Off-Policy Evaluation. InNeurIPS 2021 Datasets and Benchmarks Track. Curran Associates, Inc., Red Hook, NY, USA, 38 pages

2021
[29]

Yuta Saito and Thorsten Joachims. 2022. Off-Policy Evaluation for Large Action Spaces via Embeddings. InProceedings of the 39th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 162). PMLR, Baltimore, MD, USA, 19089–19122

2022
[30]

Yuta Saito, Qingyang Ren, and Thorsten Joachims. 2023. Off-Policy Evaluation for Large Action Spaces via Conjunct Effect Modeling. InProceedings of the 40th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 202). PMLR, Honolulu, HI, USA, 29734–29759

2023
[31]

Yuta Saito, Takuma Udagawa, Haruka Kiyohara, Kazuki Mogi, Yusuke Narita, and Kei Tateno. 2021. Evaluating the Robustness of Off-Policy Evaluation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys). 114–

2021
[32]

doi:10.1145/3460231.3474245

work page doi:10.1145/3460231.3474245
[33]

Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learning and Evaluation. InProceedings of the 33rd International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 48). PMLR, 1670– 1679

2016
[34]

Yi Su, Maria Dimakopoulou, Akshay Krishnamurthy, and Miroslav Dudík. 2020. Doubly Robust Off-Policy Evaluation with Shrinkage. InProceedings of the 37th International Conference on Machine Learning (ICML). PMLR

2020
[35]

Adith Swaminathan and Thorsten Joachims. 2015. Batch Learning from Logged Bandit Feedback through Counterfactual Risk Minimization.Journal of Machine Learning Research16, 52 (2015), 1731–1755

2015
[36]

Adith Swaminathan and Thorsten Joachims. 2015. Counterfactual Risk Min- imization: Learning from Logged Bandit Feedback. InProceedings of the 32nd International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 37). PMLR, Lille, France, 814–823

2015
[37]

Adith Swaminathan and Thorsten Joachims. 2015. The Self-Normalized Estima- tor for Counterfactual Learning. InAdvances in Neural Information Processing Systems

2015
[38]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. 2017. Off-Policy Evaluation for Slate Recommendation. InAdvances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc., Red Hook, NY, USA, 3632–3642

2017
[39]

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil, and Nathan Kallus
[40]

InAdvances in Neural Information Processing Systems, Vol

Control Variates for Slate Off-Policy Evaluation. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., Red Hook, NY, USA, 3667–3679
[41]

Peiyao Wang, Zhan Shi, Amina Shabbeer, and Ben London. 2025. Off-Policy Evaluation of Candidate Generators in Two-Stage Recommender Systems. InProceedings of the 19th ACM Conference on Recommender Systems (Rec- Sys). Association for Computing Machinery, New York, NY, USA, 350–359. doi:10.1145/3705328.3748057

work page doi:10.1145/3705328.3748057 2025
[42]

Xiaojie Wang, Rui Zhang, Yu Sun, and Jianzhong Qi. 2019. Doubly Robust Joint Learning for Recommendation on Data Missing Not at Random. InProceedings of the 36th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 97). PMLR, Long Beach, CA, USA, 6638–6647

2019
[43]

Yu-Xiang Wang, Alekh Agarwal, and Miroslav Dudík. 2017. Optimal and Adap- tive Off-Policy Evaluation in Contextual Bandits. InProceedings of the 34th International Conference on Machine Learning (ICML) (Proceedings of Machine Learning Research, Vol. 70). PMLR, Sydney, NSW, Australia, 3589–3597. CASP: Support-Aware Offline Policy Selection for Two-Stage Re...

2017