Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731

Wenlong Ji, Lihua Lei, Tijana Zrnic · 2025 · stat.ML · arXiv 2501.09731

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

We establish a formal connection between the decades-old surrogate outcome model in biostatistics and economics and the emerging field of prediction-powered inference (PPI). The connection treats predictions from pre-trained models, prevalent in the age of AI, as cost-effective surrogates for expensive outcomes. Building on the surrogate outcomes literature, we develop recalibrated prediction-powered inference, a more efficient approach to statistical inference than existing PPI proposals. Our method departs from the existing proposals by using flexible machine learning techniques to learn the optimal ``imputed loss'' through a step we call recalibration. Importantly, the method always improves upon the estimator that relies solely on the data with available true outcomes, even when the optimal imputed loss is estimated imperfectly, and it achieves the smallest asymptotic variance among PPI estimators if the estimate is consistent. Computationally, our optimization objective is convex whenever the loss function that defines the target parameter is convex. We further analyze the benefits of recalibration, both theoretically and numerically, in several common scenarios where machine learning predictions systematically deviate from the outcome of interest. We demonstrate significant gains in effective sample size over existing PPI proposals via three applications leveraging state-of-the-art machine learning/AI models.

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

stat.ML · 2026-05-28 · unverdicted · novelty 7.0

Multi-task PPI framework uses cross-task recalibration to improve inference power across related tasks, with a proof that gains require nonlinear proxy-ground-truth structure, shown on synthetic data and a 2024 election LM audit case study.

Calibeating Prediction-Powered Inference

stat.ML · 2026-04-23 · unverdicted · novelty 7.0

Post-hoc calibration of miscalibrated black-box predictions on a labeled sample improves efficiency of prediction-powered inference for semisupervised mean estimation.

Multi-Armed Bandits With Machine Learning-Generated Surrogate Rewards

math.ST · 2025-06-20 · unverdicted · novelty 7.0

The MLA-UCB algorithm uses ML-generated surrogate rewards from auxiliary data to provably lower cumulative regret in multi-armed bandits, achieving asymptotic optimality under joint Gaussian assumptions without requiring knowledge of the true-surrogate covariance.

Estimate Level Adjustment For Inference With Proxies Under Random Distribution Shifts

stat.ME · 2026-05-07 · unverdicted · novelty 5.0

A framework models proxy-primary outcome discrepancies as random effects at the parameter level, estimated from aggregated historical observations to calibrate inferences under distribution shifts.

Active Hypothesis Testing under Computational Budgets with Applications to GWAS and LLM

stat.ME · 2025-12-01 · unverdicted · novelty 5.0

Active hypothesis testing framework uses auxiliary statistics for data-adaptive budget allocation to produce valid p-values or e-values with optimality under independence and admissibility under dependence.

Semiparametric semi-supervised learning for general targets under distribution shift and decaying overlap

math.ST · 2025-05-09 · unverdicted · novelty 5.0

Introduces D2S3 semiparametric framework that extends AIPW estimators to semi-supervised settings with MAR labeling, distribution shift, and decaying overlap, supplying corrected asymptotic rates instead of root-n convergence.

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation

cs.AI · 2026-05-29 · unverdicted · novelty 3.0

GLIDE is a Python library that packages multiple PPI estimators and samplers for reliable GenAI evaluation and reports annotation savings in an agentic case study.

Allocating Human Oversight in AI-Enabled Analytics

cs.LG · 2026-04-14

citing papers explorer

Showing 1 of 1 citing paper after filters.

Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation cs.AI · 2026-05-29 · unverdicted · none · ref 8 · internal anchor
GLIDE is a Python library that packages multiple PPI estimators and samplers for reliable GenAI evaluation and reports annotation savings in an agentic case study.

Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer