pith. sign in

arxiv: 2605.29249 · v1 · pith:73ZPQZQGnew · submitted 2026-05-28 · 📊 stat.ML · cs.LG

Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

Pith reviewed 2026-06-29 05:58 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords prediction-powered inferencemulti-task inferenceproxy measurementsconfidence intervalsAI evaluationstatistical powerrecalibrationnonlinear structure
0
0 comments X

The pith

Multi-task prediction-powered inference uses cross-task recalibration to improve power from scarce labels across related tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to perform valid statistical inference on many related tasks when only a few ground-truth labels are available for each. It borrows information from related tasks by recalibrating the relationship between cheap proxies and expensive labels, while keeping estimates specific to each task. This approach is shown to work only when the proxy-label relationship has nonlinear shared patterns across tasks, as linear adjustments offer no extra benefit over using the proxy alone. The result matters because it allows tighter confidence intervals in settings like evaluating AI models on many prompts or surveying opinions on many related questions.

Core claim

We introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy.

What carries the argument

Cross-task recalibration of proxies using labels from related tasks to exploit shared nonlinear structure while preserving task-specific validity.

If this is right

  • Scarce labels per task can still yield precise task-specific estimates by using data from related tasks.
  • No asymptotic efficiency gain from cross-task methods if the relationship is linear.
  • The framework applies to auditing language models on multiple election-related topics using human annotations.
  • Experiments on synthetic data confirm the theoretical distinction between linear and nonlinear cases.
  • Point estimates and intervals remain valid for each individual task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could test the method on tasks with varying degrees of relatedness to map when the nonlinear structure appears.
  • This approach might reduce the total labeling budget needed for comprehensive AI safety evaluations across many behaviors.
  • In social science, it could allow finer-grained analysis of survey responses across demographics without additional data collection.

Load-bearing premise

The proxy and ground-truth labels share nonlinear structure across tasks that cross-task recalibration can use while keeping each task's inference valid.

What would settle it

Finding no reduction in confidence interval width from the multi-task method compared to single-task power-tuned PPI in a setting where the proxy-ground-truth map is known to be linear across tasks.

Figures

Figures reproduced from arXiv: 2605.29249 by Chara Podimata, Ellery Stahler, Nicolas Emmenegger.

Figure 1
Figure 1. Figure 1: Effects of using our methods (GREPPI , AREPPI ) on a real dataset with human annotations. The plots here are shown without power-tuning (see discussion in Section 4). We treat gpt-5-mini predictions as surrogate outcomes, and human annotations as ground truth. Our methods achieve sharper CIs thanks to cross-task sharing of information. systematically biased. Outside of the machine learning and AI communiti… view at source ↗
Figure 2
Figure 2. Figure 2: Semi-synthetic data experiment, with gpt-5-mini as ground truth and gpt-4o-mini as the annotator. For plots (a)-(b), we treat PPI++ as the reference, and plot all other methods in a relative sense. We evaluate CLASSICAL , PPI++ , REPPI , GREPPI and AREPPI with local power tuning. Lemma 3.1 (Affine invariance). Let s ∈ Haffine, that is sˆ(Yˆ ) = aYˆ + b and let id : Yˆ 7→ Yˆ . Then, V ⋆ (ˆs) = V ⋆ (id). Pro… view at source ↗
Figure 3
Figure 3. Figure 3: Same setup as Figure 2. We illustrate the effects of power tuning’s underestimation of [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of auxiliary-task heterogeneity for the three broadest synthetic task families. Top [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of auxiliary-task heterogeneity for the two most homogeneous synthetic task families. [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
read the original abstract

Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, ground-truth labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces a multi-task extension of prediction-powered inference (PPI) that leverages labeled data from related tasks via cross-task recalibration to improve power for task-specific inference while preserving validity through within-task rectification and power tuning. It proves that efficiency gains beyond power-tuned PPI require nonlinear structure in the proxy-ground-truth relationship and that affine cross-task recalibrations are asymptotically equivalent to the original proxy. Theoretical results are supported by synthetic and semi-synthetic experiments plus a case study auditing language models on 2024 election-related information using human annotations.

Significance. If the central claims hold, the work is significant for AI evaluation and social-science applications that require valid inference across many related tasks with scarce labels. The proof that gains are possible only under nonlinear structure, together with the negative result on affine recalibrations, supplies a clear boundary condition that is useful for practitioners. The retention of within-task rectification directly supports task-specific validity. The real-world case study adds practical value. The combination of theory with reproducible-style experiments on semi-synthetic data is a strength.

minor comments (3)
  1. Abstract: the phrase 'power-tuned PPI' is used without a one-sentence reminder of its definition or a citation to the base method, which reduces accessibility for readers outside the immediate PPI literature.
  2. §5 (experiments): the description of how the semi-synthetic data were constructed to satisfy the nonlinear-structure condition could be expanded with one additional sentence or a small table listing the functional forms used.
  3. Figure 4 (case study): axis labels on the confidence-interval-width plot are slightly compressed; increasing font size or adding a short caption note would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, including the recognition of its significance for AI evaluation and social-science applications, the value of the theoretical boundary conditions, and the practical utility of the case study. We appreciate the recommendation for minor revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper extends PPI with a multi-task recalibration framework and supplies an asymptotic proof that efficiency gains require nonlinear proxy-ground-truth structure while affine recalibrations are equivalent to the base proxy. These results are derived from standard statistical arguments (asymptotics, task-specific rectification) that remain independent of the paper's own fitted quantities or self-citations. No step reduces by construction to a parameter fit or prior self-citation; the derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that related tasks share exploitable structure in the proxy-ground-truth mapping; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Related tasks share a common structure in the proxy-ground-truth relationship that cross-task recalibration can exploit.
    Invoked to justify the multi-task framework and the claim of efficiency gains.

pith-pipeline@v0.9.1-grok · 5798 in / 1198 out tokens · 30861 ms · 2026-06-29T05:58:35.540551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Valid Inference with Synthetic Data via Task Exchangeability

    stat.ME 2026-06 unverdicted novelty 6.0

    Proposes task exchangeability as a condition for valid inference when using synthetic data in scientific research, with methods and extensions demonstrated on surveys and AI evaluations.

Reference graph

Works this paper leans on

18 extracted references · 8 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    PPI++: Efficient Prediction-Powered Inference

    Anastasios N Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic. Prediction-powered inference.Science, 382(6671):669–674, 2023a. Anastasios N Angelopoulos, John C Duchi, and Tijana Zrnic. PPI++: Efficient prediction-powered inference.arXiv preprint arXiv:2311.01453, 2023b. Julia Angwin, Alondra Nelson, and Rina Palta. Seeking...

  2. [2]

    Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932,

    Usman Anwar, Abulhair Saparov, Javier Rando, Daniel Paleka, Miles Turpin, Peter Hase, Ekdeep Singh Lubana, Erik Jenner, Stephen Casper, Oliver Sourbut, et al. Foundational challenges in assuring alignment and safety of large language models.arXiv preprint arXiv:2404.09932,

  3. [3]

    Large-scale, longitudinal study of large language models during the 2024 US election season

    Sarah H Cen, Andrew Ilyas, Hedi Driss, Charlotte Park, Aspen Hopkins, Chara Podimata, et al. Large-scale, longitudinal study of large language models during the 2024 US election season. arXiv preprint arXiv:2509.18446,

  4. [4]

    Precise model benchmarking with only a few observations

    Riccardo Fogliato, Pratik Patil, Nil-Jana Akpinar, and Mathew Monfort. Precise model benchmarking with only a few observations. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9563–9575,

  5. [5]

    Another look at statistical inference with machine learning-imputed data.arXiv preprint arXiv:2411.19908,

    Jessica Gronsbell, Jianhui Gao, Zachary R McCaw, Yaqi Shi, and David Cheng. Another look at statistical inference with machine learning-imputed data.arXiv preprint arXiv:2411.19908,

  6. [6]

    Predictions as Surrogates: Revisiting Surrogate Outcomes in the Age of AI

    Wenlong Ji, Lihua Lei, and Tijana Zrnic. Predictions as surrogates: Revisiting surrogate outcomes in the age of AI.arXiv preprint arXiv:2501.09731,

  7. [7]

    PPI is the difference estimator: Recognizing the survey sampling roots of prediction- powered inference.arXiv preprint arXiv:2603.19160,

    Reagan Mozer. PPI is the difference estimator: Recognizing the survey sampling roots of prediction- powered inference.arXiv preprint arXiv:2603.19160,

  8. [8]

    Discovering language model behaviors with model-written evaluations

    Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics: ACL 2023, pages 13387–13434,

  9. [9]

    Demystifying prediction powered inference

    Yilin Song, Dan M Kluger, Harsh Parikh, and Tian Gu. Demystifying prediction powered inference. arXiv preprint arXiv:2601.20819,

  10. [10]

    Calibeating Prediction-Powered Inference

    Lars van der Laan and Mark van der Laan. Calibeating prediction-powered inference.arXiv preprint arXiv:2604.21260,

  11. [11]

    Meta-theorizing framing in communication research (1992–2022): Toward academic silos or professionalized specialization?Journal of Communication, 74(2): 101–116,

    Dror Walter and Yotam Ophir. Meta-theorizing framing in communication research (1992–2022): Toward academic silos or professionalized specialization?Journal of Communication, 74(2): 101–116,

  12. [12]

    Can LLMs ex- press their uncertainty? An empirical evaluation of confidence elicitation in LLMs

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs ex- press their uncertainty? An empirical evaluation of confidence elicitation in LLMs. InInternational Conference on Learning Representations, volume 2024, pages 23650–23678,

  13. [13]

    Fix a recalibration map s and write the shorthands i :=s(X i)

    15 A Missing Derivations and Proofs A.1 Derivation of Oracle VarianceV(s, λ) We derive the variance functional used in Section 2.4. Fix a recalibration map s and write the shorthands i :=s(X i). Letλ∈Rbe the scalar power-tuning coefficient. Using the shorthands ¯YL := 1 n X i∈L Yi,¯s L := 1 n X i∈L si,¯s N := 1 N X i∈[N] si, the power-tuned estimator from...

  14. [14]

    This can also be rewritten as Var[ˆZ] = 1 n 1− n N S2 Z whereS 2 Z = 1 N−1 X i∈[N] (Zi − ¯Z)2 is the sample variance

    we can express Var[¯ZL] = 1 n 1− n−1 N−1 σ2 Z, whereσ 2 Z = 1 N P i∈[N](Zi − ¯Z)2. This can also be rewritten as Var[ˆZ] = 1 n 1− n N S2 Z whereS 2 Z = 1 N−1 X i∈[N] (Zi − ¯Z)2 is the sample variance. We will approximate the latter with the plug-in estimate ˆS2 Z = 1 n−1 X i∈L (Zi − ¯ZL)2, that satisfies S2 Z =E[ ˆS2 Z] under simple random sampling (e.g.,...

  15. [15]

    Substituting back and writing ρ2 N(Y, s) := Cov N(Y, s)2/(VarN(Y) VarN(s)) for the finite- population squared correlation, we get V ⋆(s) = 1 n 1− n N S2 Y 1−ρ 2 N(Y, s)

    The minimizer is therefore λ⋆(s) = CovN(Y, s) VarN(s) . Substituting back and writing ρ2 N(Y, s) := Cov N(Y, s)2/(VarN(Y) VarN(s)) for the finite- population squared correlation, we get V ⋆(s) = 1 n 1− n N S2 Y 1−ρ 2 N(Y, s) . 16 A.2 Derivations for the Superpopulation Case The variance expression in Appendix A.1 differs from the standard superpopulation ...

  16. [16]

    [2025], finite-population, mean-estimation adaptation) Require:Task datasetD (t) = ( ˆY (t) i , Y (t) i , O(t) i )N i=1, recalibration classH

    Algorithm 3REPPI (Ji et al. [2025], finite-population, mean-estimation adaptation) Require:Task datasetD (t) = ( ˆY (t) i , Y (t) i , O(t) i )N i=1, recalibration classH. 1:Randomly splitL (t) evenly into two foldsAandB. 2:For eachF∈ {A, B}, fitˆs F ∈ Hon{( ˆY (t) i , Y (t) i )}i∈F so thatˆsF ( ˆY (t))≈E N[Y (t) | ˆY (t)].▷ collapses Steps 2–3 of Ji et al...

  17. [17]

    Algorithm 2)

    • AREPPI :r (t) i =Y (t) i − ˆλoof L(t) u(t) i (c.f. Algorithm 2). A few remarks. The SE plug-ins above treat ˆλ and ˆsas fixed, resulting in bias in small samples, and we empirically diagnose its small-n effect on coverage in Fig. 3 (a)–(d) when power tuning is used. We note in the experiments that power tuning is often not necessary if tasks are suffici...

  18. [18]

    endogenous

    Thus, for any given model, there is one fixed response for a given query. To focus our analysis on the most relevant factors, we systematically prune the models, base questions, and prompt variations from the original dataset. Models (M):We restrict our attention to the offline versions of three prominent models: GPT-4o, Claude-3.5-Sonnet, and Gemini-1.0-...