pith. machine review for the scientific record. sign in

arxiv: 2604.10770 · v1 · submitted 2026-04-12 · 💰 econ.EM

Recognition: unknown

Econometric Inference with Machine-Learned Proxies: Partial Identification via Data Combination

Lixiong Li

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💰 econ.EM
keywords machine learning proxiespartial identificationoptimal transportdata combinationeconometric inferencemoment modelsvalidation samples
0
0 comments X

The pith

Econometric models using machine-learned proxies can deliver sharp partial identification and valid inference by linking a main sample to an auxiliary validation sample through unconditional optimal transport.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a method for partial identification in general moment models when researchers employ upstream machine learning to create proxies for latent variables from complex data. It treats the proxy as a linking variable that connects a downstream sample containing observed covariates and the proxy with an auxiliary validation sample that jointly observes the proxy and the true target variable. The approach derives sharp bounds on the model parameters via an unconditional optimal transport characterization without imposing consistency or rate assumptions on the machine learning procedure. An inference procedure then uses analytical critical values to produce confidence sets that control asymptotic size without resampling. If correct, this framework lets applied researchers incorporate unstructured data sources into econometric analysis while avoiding the bias and invalid inference that arise from naive plug-in use of proxies.

Core claim

The paper claims that treating the machine-learned proxy as a linking variable between the downstream sample and the auxiliary validation sample yields a sharp identified set for general moment models via an unconditional optimal transport characterization, together with an inference procedure that achieves correct asymptotic size control using analytical critical values without any resampling.

What carries the argument

Unconditional optimal transport characterization of the joint distribution of the proxy and target variable that produces the tightest possible bounds on the downstream moments.

If this is right

  • Applied researchers obtain informative confidence sets for parameters in moment models even when the machine learning proxy has unknown predictive accuracy.
  • The method extends to any general moment model without requiring the machine learning procedure to be consistent or to have a known convergence rate.
  • Inference requires only analytical critical values and avoids the computational cost of bootstrap or other resampling methods.
  • Monte Carlo experiments confirm that the procedure maintains reliable size control and produces informative sets across a range of proxy accuracies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be extended to settings with multiple proxies or multiple validation samples to tighten bounds further.
  • Researchers might adapt the linking idea to panel or time-series data where the proxy appears in different periods.
  • This approach suggests a general template for combining any two datasets that share a common observed variable when direct merging is impossible.

Load-bearing premise

The proxy functions as a valid linking variable between the two samples and the unconditional optimal transport problem delivers sharp partial identification bounds without any further restrictions on the upstream machine learning procedure.

What would settle it

An empirical application or Monte Carlo design in which the true parameter value lies outside the reported confidence sets even though the proxy correctly links the samples and the data-generating process satisfies the paper's assumptions.

Figures

Figures reproduced from arXiv: 2604.10770 by Lixiong Li.

Figure 1
Figure 1. Figure 1: 95% Confidence Set across Different Sample Sizes Notes: Each panel reports the empirical rejection frequency of the proposed test at the 5% level over a grid of 10000 candidate values of (θ1, θ2), based on 500 Monte Carlo replications. The sieve function is defined in (30) and cn = n 2/5 . The implied 95% confidence set in each replication is the set of grid points not rejected by the test. Dark purple cor… view at source ↗
Figure 2
Figure 2. Figure 2: 95% Confidence Set across Prediction Noises Notes: Each panel reports the empirical rejection frequency of the proposed test at the 5% level under a different prediction-noise design. The heatmaps are constructed in the same manner as in [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of Stratification on Confidence Sets Notes: Each panel reports the empirical rejection frequency of the proposed test at the 5% level across ho￾moskedastic and heteroskedastic prediction-noise designs, with and without stratification. The heatmaps are constructed in the same manner as in [PITH_FULL_IMAGE:figures/full_fig_p034_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 95% Confidence Sets with Binary and Continuous Proxies: (nd, nv) = (1000, 1000) Notes: Each panel reports the empirical rejection frequency of the proposed test at the 5% level over a grid of 10,000 candidate values of (θ1, θ2), based on 500 Monte Carlo replications. The upper-left panel corresponds to the binary proxy defined in (29), using the sieve basis in (30). The remaining panels correspond to the c… view at source ↗
Figure 5
Figure 5. Figure 5: 95% Confidence Sets with Binary and Continuous Proxies: (nd, nv) = (5000, 5000) Notes: Each panel reports the empirical rejection frequency of the proposed test at the 5% level over a grid of 10,000 candidate values of (θ1, θ2), based on 500 Monte Carlo replications. The upper-left panel corresponds to the binary proxy defined in (29), using the sieve basis in (30). The remaining panels correspond to the c… view at source ↗
read the original abstract

Empirical researchers increasingly use upstream machine-learning (ML) methods to construct proxies for latent target variables from complex, unstructured data. A naive plug-in use of such proxies in downstream econometric models, however, can lead to biased estimation and invalid inference. This paper develops a framework for partial identification and inference in general moment models with ML-generated proxies. Our approach does not require restrictive assumptions on the upstream ML procedure, such as consistency or known convergence rates, nor does it require a complete validation sample containing all variables used in the downstream analysis. Instead, we assume access to two datasets: a downstream sample containing observed covariates and the proxy, and an auxiliary validation sample containing joint observations on the proxy and its target variable. We treat the proxy as a linking variable between these two samples, rather than as a literal noisy substitute for the latent target variable. Building on this idea, we develop a sharp identification strategy based on an unconditional optimal transport characterization and an inference procedure that controls asymptotic size using analytical critical values without resampling. Monte Carlo simulations show reliable size control and informative confidence sets across a range of predictive-accuracy scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper develops a partial identification and inference framework for general moment models E[m(X, Y*, Z; θ)] = 0 that incorporate machine-learned proxies Z for a latent target Y*. It combines a downstream sample (containing covariates X and proxy Z) with an auxiliary validation sample (containing Z and Y*) by treating the proxy as a linking variable. Identification proceeds via an unconditional optimal transport characterization of the joint distribution of (Z, Y*), yielding sharp bounds on θ without requiring consistency or known rates for the upstream ML procedure. Inference uses analytical critical values that control asymptotic size without resampling. Monte Carlo simulations illustrate reliable size control and informative confidence sets across varying proxy accuracy levels.

Significance. If the central claims hold, the paper would make a substantial contribution to econometric practice with ML-generated proxies by enabling valid inference in moment models under minimal assumptions on the upstream learner and without requiring a complete validation sample. The data-combination approach combined with analytical (non-resampling) critical values is a notable strength that reduces computational demands while maintaining size control, and the Monte Carlo evidence supports practical applicability across predictive-accuracy regimes.

major comments (1)
  1. [Identification section / abstract] The unconditional optimal transport characterization (abstract and identification section): the paper asserts that this delivers sharp partial identification bounds for general moment conditions involving covariates X. However, the unconditional coupling between the marginals of Z and Y* does not account for dependence between X and Y*. For non-separable m(X, Y*, Z; θ), the extremal joints from unconditional OT will generally produce strictly conservative bounds relative to the sharp identified set that respects the joint (X, Y*) distribution. This directly undermines the sharpness claim that is central to the contribution.
minor comments (2)
  1. [Introduction] The notation for the moment function m(·) and the precise statement of the data-combination assumption (downstream vs. validation samples) could be introduced earlier and with greater formality to improve readability for readers unfamiliar with the transport approach.
  2. [Monte Carlo section] Monte Carlo design: the reported scenarios vary predictive accuracy but do not include cases with strong dependence between X and Y*; adding such designs would better illustrate whether the reported size control persists under the conditions where the identification concern is most relevant.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We are grateful to the referee for providing a detailed and insightful report on our manuscript. The major comment raises an important point regarding the sharpness of the identified set, which we address below by committing to a revision that strengthens the identification strategy.

read point-by-point responses
  1. Referee: [Identification section / abstract] The unconditional optimal transport characterization (abstract and identification section): the paper asserts that this delivers sharp partial identification bounds for general moment conditions involving covariates X. However, the unconditional coupling between the marginals of Z and Y* does not account for dependence between X and Y*. For non-separable m(X, Y*, Z; θ), the extremal joints from unconditional OT will generally produce strictly conservative bounds relative to the sharp identified set that respects the joint (X, Y*) distribution. This directly undermines the sharpness claim that is central to the contribution.

    Authors: We thank the referee for this careful observation. Upon reflection, we agree that using an unconditional optimal transport coupling between the marginal distributions of Z and Y* would indeed fail to fully account for the dependence structure between X and Y* induced by the common Z, leading to conservative bounds for non-separable moment functions m. To achieve sharp partial identification, the optimal transport must be conducted conditionally on Z, coupling the conditional distributions P(X|Z) and P(Y*|Z) for each value of the linking variable. We will revise the identification section to explicitly characterize the sharp identified set using conditional optimal transport given Z. This revision will also update the abstract to reflect the conditional nature of the transport. The Monte Carlo simulations and inference procedure will remain applicable, as they do not rely on the unconditional aspect. We believe this change will clarify and strengthen the contribution without altering the core data-combination approach. revision: yes

Circularity Check

0 steps flagged

No circularity; identification derived from data-combination structure and transport theory

full rationale

The paper's central claim rests on treating the proxy as a linking variable between downstream and validation samples, then applying an unconditional optimal transport map to characterize the joint distribution for partial identification in general moment models. This draws directly from the data-combination setup and standard optimal transport results rather than redefining quantities in terms of themselves or renaming fitted parameters as predictions. No load-bearing self-citations, ansatz smuggling, or uniqueness theorems imported from the authors' prior work appear in the derivation chain. The approach is self-contained against external benchmarks from transport theory and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard econometric assumptions for moment models and data combination plus the novel transport characterization. No free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption The proxy variable is observed in both the downstream sample and the auxiliary validation sample and serves as a linking variable.
    Central to the data-combination strategy described in the abstract.
  • domain assumption Unconditional optimal transport between the linked samples yields sharp identification bounds.
    Basis for the partial identification result.

pith-pipeline@v0.9.0 · 5488 in / 1201 out tokens · 76137 ms · 2026-05-10T15:08:44.974277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references

  1. [1]

    Machine learning and prediction errors in causal inference

    Aksoy, Cevat Giray, Jose Maria Barrero, Nicholas Bloom, Steven Davis, Mathias Dolls, and Pablo Zarate (2022).Working from Home Around the World. en. Tech. rep. w30446. National Bureau of Economic Research, w30446. Allon, Gad, Daniel Chen, Zhenling Jiang, and Dennis Zhang (2023). “Machine learning and prediction errors in causal inference”.The Wharton Scho...

  2. [2]

    Prediction-powered inference

    Angelopoulos, Anastasios N, Stephen Bates, Clara Fannjiang, Michael I Jordan, and Tijana Zrnic (2023). “Prediction-powered inference”.Science382.6671, pp. 669–674. Angelopoulos, Anastasios N., John C. Duchi, and Tijana Zrnic (2024).PPI++: Efficient Prediction-Powered Inference. Athey, Susan, Raj Chetty, Guido W Imbens, and Hyunseung Kang (2025). “The Surr...

  3. [3]

    A Unified Approach to Regression Analysis Under Double-Sampling Designs

    Chen, Yi-Hau and Hung Chen (2000). “A Unified Approach to Regression Analysis Under Double-Sampling Designs”. en.Journal of the Royal Statistical Society Series B: Statistical Methodology62.3, pp. 449–460. Chen, Xiaohong, Han Hong, and Elie Tamer (2005). “Measurement Error Models with Aux- iliary Data”. en.Review of Economic Studies72.2, pp. 343–366. Cros...

  4. [4]

    Debiasing Machine-Learning- or AI-Generated Regressors in Partial Linear Models

    Grundlehren der mathematischen Wissenschaften. Zhang, Jingwen, Wendao Xue, Yifan Yu, and Yong Tan (2023). “Debiasing Machine-Learning- or AI-Generated Regressors in Partial Linear Models”. en.SSRN Electronic Journal. Zrnic, Tijana and Emmanuel J. Cand` es (2023).Cross-Prediction-Powered Inference. Johns Hopkins University Email address:lixiong.li@jhu.edu