Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

Lihua Lei; Tijana Zrnic; Wenlong Ji

arxiv: 2501.09731 · v2 · pith:ZKAQEUDXnew · submitted 2025-01-16 · 📊 stat.ML · cs.LG

Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

Wenlong Ji , Lihua Lei , Tijana Zrnic This is my paper

classification 📊 stat.ML cs.LG

keywords outcomesexistinginferencelearninglossmachinepredictionsproposals

0 comments

read the original abstract

We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss'' through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Pandora's Box for Contextual LLM Cascading
cs.AI 2026-06 unverdicted novelty 7.0

Introduces a parametric reservation-index policy with GMM estimation and UCB exploration for contextual LLM cascading under output-mediated feedback, claiming dimension-dependent square-root regret.
Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research
stat.ML 2026-05 unverdicted novelty 7.0

Multi-task PPI framework uses cross-task recalibration to improve inference power across related tasks, with a proof that gains require nonlinear proxy-ground-truth structure, shown on synthetic data and a 2024 electi...
Calibeating Prediction-Powered Inference
stat.ML 2026-04 unverdicted novelty 7.0

Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.
Allocating Human Oversight in AI-Enabled Analytics
cs.LG 2026-04 unverdicted novelty 7.0

An adaptive budget allocation algorithm for LLM-augmented surveys learns question-level LLM reliability on the fly from human labels and reduces labeling waste from 10-12% to 2-6% compared to uniform allocation.
Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards
math.ST 2025-06 unverdicted novelty 7.0

The MLA-UCB algorithm uses ML-generated surrogate rewards from auxiliary data to provably lower cumulative regret in multi-armed bandits, achieving asymptotic optimality under joint Gaussian assumptions without requir...
Valid Inference with Synthetic Data via Task Exchangeability
stat.ME 2026-06 unverdicted novelty 6.0

Proposes task exchangeability as a condition for valid inference when using synthetic data in scientific research, with methods and extensions demonstrated on surveys and AI evaluations.
On prediction-powered inference for quantile regression via convolution smoothing
stat.ME 2026-06 unverdicted novelty 6.0

Introduces convolution smoothing of the check-loss for prediction-powered quantile regression, derives asymptotics under misspecification, and proposes an ensemble estimator.
Optimized Labeling Resource Allocation for Prediction-Assisted Inference via OPAL
stat.ME 2026-06 unverdicted novelty 6.0

OPAL learns optimal smooth labeling policies from ML uncertainty scores to enable low-variance prediction-assisted inference with finite-sample coverage guarantees.
Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts
stat.ME 2026-05 unverdicted novelty 5.0

A framework models proxy-primary outcome discrepancies as random effects at the parameter level, estimated from aggregated historical observations to calibrate inferences under distribution shifts.
Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM
stat.ME 2025-12 unverdicted novelty 5.0

Active hypothesis testing framework uses auxiliary statistics for data-adaptive budget allocation to produce valid p-values or e-values with optimality under independence and admissibility under dependence.
Semiparametric semi-supervised learning for general targets under distribution shift and decaying overlap
math.ST 2025-05 unverdicted novelty 5.0

Introduces D2S3 semiparametric framework that extends AIPW estimators to semi-supervised settings with MAR labeling, distribution shift, and decaying overlap, supplying corrected asymptotic rates instead of root-n con...
Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation
cs.AI 2026-05 unverdicted novelty 3.0

GLIDE is a Python library that packages multiple PPI estimators and samplers for reliable GenAI evaluation and reports annotation savings in an agentic case study.