pith. machine review for the scientific record. sign in

arxiv: 2604.20416 · v1 · submitted 2026-04-22 · 📊 stat.AP

Recognition: unknown

SHARELIFE Imputations

Giuseppe De Luca, Paolo Li Donni

Pith reviewed 2026-05-09 23:04 UTC · model grok-4.3

classification 📊 stat.AP
keywords multiple imputationSHARELIFElife course dataitem nonresponsefully conditional specificationretrospective surveySHARE
0
0 comments X

The pith

SHARELIFE life-course data receive multiple imputations via fully conditional specification that align with observed responses and external benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create usable complete datasets from SHARELIFE Waves 3 and 7 by filling in gaps left by item nonresponse, especially in monetary and biographical variables that span many years and currencies. It walks through the recoding needed to harmonize the retrospective questions on partnerships, fertility, jobs, and residences, then applies an iterative imputation procedure that builds a model for each variable from all the others. A sympathetic reader would care because life-history analyses on employment, family formation, or migration would otherwise lose cases or carry bias from the missing answers. The authors report that the filled values reproduce the patterns seen in the actually observed data, match results from inverse-propensity weighting, and line up with complete records from the ordinary SHARE waves. This supplies researchers with ready-to-use imputed files whose statistical properties can be checked against multiple reference points.

Core claim

The central claim is that an imputation model based on fully conditional specification, after appropriate data harmonization that includes currency conversions across time periods, generates completed SHARELIFE records whose distributions are consistent both internally with the observed cases and externally with alternative nonresponse corrections and with data from the regular SHARE waves.

What carries the argument

The fully conditional specification imputation procedure, which draws each incomplete variable in turn from a conditional model given all other variables and repeats the cycle to produce multiple completed datasets.

If this is right

  • Analyses of life-course transitions can retain the full sample rather than dropping respondents who have any missing retrospective items.
  • Results obtained from the imputed files can be checked for sensitivity by comparing them with estimates that use inverse-propensity weighting on the same incomplete data.
  • The completed datasets support pooled analyses that combine SHARELIFE information with the longitudinal measures collected in the standard SHARE waves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers studying retirement or health trajectories can now include individuals who would otherwise be excluded because of incomplete life-history modules.
  • The same preparation and imputation workflow could be applied to other retrospective surveys that collect biographical data across multiple domains and currencies.
  • If later waves add new observations on the same individuals, the imputed baseline values can serve as starting points for dynamic models of change.

Load-bearing premise

The missing values are missing at random given the other variables that are included in the imputation models.

What would settle it

A direct comparison in which the distribution of imputed pension amounts or employment durations deviates systematically from the distribution seen in the non-missing cases or from the corresponding distribution in the regular SHARE waves for the same birth cohorts.

Figures

Figures reproduced from arXiv: 2604.20416 by Giuseppe De Luca, Paolo Li Donni.

Figure 1
Figure 1. Figure 1: German Marks (DEM/GDR) conversion coverage. [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: German Marks (DEM/GDR) conversion coverage. [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Kernel densities of monthly maternity benefits in t [PITH_FULL_IMAGE:figures/full_fig_p040_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Kernel densities of first monthly wages in the obser [PITH_FULL_IMAGE:figures/full_fig_p041_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Kernel densities of first monthly incomes from self- [PITH_FULL_IMAGE:figures/full_fig_p042_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Kernel densities of pension benefits when retired i [PITH_FULL_IMAGE:figures/full_fig_p043_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Kernel densities of current monthly wages in the ob [PITH_FULL_IMAGE:figures/full_fig_p043_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Kernel densities of current monthly incomes from s [PITH_FULL_IMAGE:figures/full_fig_p044_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Kernel densities of monthly wages at the end of the m [PITH_FULL_IMAGE:figures/full_fig_p044_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Kernel densities of monthly incomes from self-em [PITH_FULL_IMAGE:figures/full_fig_p045_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Kernel densities of maternity benefits and monthl [PITH_FULL_IMAGE:figures/full_fig_p045_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Scatterplot between monthly maternity benefits a [PITH_FULL_IMAGE:figures/full_fig_p069_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Scatterplot of first monthly wages over time [PITH_FULL_IMAGE:figures/full_fig_p070_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Scatterplot of first monthly incomes from self-em [PITH_FULL_IMAGE:figures/full_fig_p070_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Scatterplot of monthly pension benefits and year o [PITH_FULL_IMAGE:figures/full_fig_p071_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Scatterplot between current monthly wages and in [PITH_FULL_IMAGE:figures/full_fig_p071_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Scatterplot between current monthly incomes fro [PITH_FULL_IMAGE:figures/full_fig_p072_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Scatterplot of monthly wages at the end of the main [PITH_FULL_IMAGE:figures/full_fig_p072_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Scatterplot of monthly incomes from self-employ [PITH_FULL_IMAGE:figures/full_fig_p073_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Kernel densities of first monthly wages and monthl [PITH_FULL_IMAGE:figures/full_fig_p083_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Kernel densities of current monthly wages and cur [PITH_FULL_IMAGE:figures/full_fig_p083_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Kernel densities of monthly wages and monthly inc [PITH_FULL_IMAGE:figures/full_fig_p084_22.png] view at source ↗
read the original abstract

This report describes the SHARELIFE-MI project, which aims to generate multiple imputations for missing values in the life-course data collected in SHARELIFE Waves 3 and 7. The SHARELIFE study reconstructs individual life histories through retrospective questions covering key biographical domains such as partnerships, fertility, employment, and residence. As in the regular SHARE waves, item nonresponse represents an important source of nonsampling error - particularly for monetary variables, which require conversions across multiple currencies and long time periods. We document the preliminary data recoding and harmonization steps, as well as the design, specification, and implementation of an imputation model based on the fully conditional specification approach. Finally, we assess the internal and external validity of the resulting imputations through comparisons with the observed data, alternative nonresponse adjustments based on inverse propensity weighting, and external benchmarks from the regular SHARE waves.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the SHARELIFE-MI project, which generates multiple imputations for item nonresponse in retrospective life-course data from SHARELIFE Waves 3 and 7. It details data recoding and harmonization, specifies an imputation model via fully conditional specification (FCS), and evaluates the resulting imputations for internal and external validity through comparisons with observed data, inverse propensity weighting adjustments, and external benchmarks from regular SHARE waves.

Significance. If the validity assessments hold, the work supplies a practical resource that can reduce bias from nonresponse in analyses of partnerships, fertility, employment, and residence histories, thereby increasing the usability of SHARELIFE for life-course research.

major comments (2)
  1. [Imputation model specification and internal validity assessment] The FCS imputation design imputes each variable from its own conditional model without built-in rejection sampling or post-processing to enforce temporal and logical constraints across interdependent events (e.g., child birth year must follow parent birth year + 15 and precede the survey year; partnership end dates must follow start dates). The internal validity checks compare only univariate or low-order moments with observed cases or IPW-adjusted estimates and therefore cannot detect joint violations. If the fraction of inconsistent imputed trajectories exceeds the near-zero rate observed in complete cases, the claim that the imputations match the true conditional distribution under the MAR assumption fails regardless of marginal agreement. (Imputation model specification and internal validity assessment sections.)
  2. [Validation approaches] No model equations, convergence diagnostics, fraction of missing information, or quantitative comparison metrics (e.g., standardized differences, overlap statistics) are reported, leaving the central claim of 'good internal and external validity' only partially supported by the described validation approaches. (Validation approaches section.)
minor comments (2)
  1. [Abstract] The abstract outlines the three validation approaches but does not summarize any numerical findings from them.
  2. [References] Add explicit references to standard FCS literature (van Buuren & Groothuis-Oudshoorn) and to SHARELIFE documentation for the harmonization steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and indicate the revisions that will be incorporated in the next version.

read point-by-point responses
  1. Referee: [Imputation model specification and internal validity assessment] The FCS imputation design imputes each variable from its own conditional model without built-in rejection sampling or post-processing to enforce temporal and logical constraints across interdependent events (e.g., child birth year must follow parent birth year + 15 and precede the survey year; partnership end dates must follow start dates). The internal validity checks compare only univariate or low-order moments with observed cases or IPW-adjusted estimates and therefore cannot detect joint violations. If the fraction of inconsistent imputed trajectories exceeds the near-zero rate observed in complete cases, the claim that the imputations match the true conditional distribution under the MAR assumption fails regardless of marginal agreement. (Imputation model specification and internal validity assessment sections.)

    Authors: We agree that the absence of explicit rejection sampling or post-processing for all temporal and logical constraints represents a limitation of the current FCS implementation. While the imputation sequence and predictor choices incorporated some basic ordering constraints (e.g., parent birth years before child birth years), comprehensive enforcement across all interdependent events was not applied. We will revise the internal validity assessment section to report the proportion of imputed trajectories that violate key logical constraints (such as birth order and date sequencing) and compare this rate directly to the near-zero rate observed in complete cases. This addition will allow readers to evaluate whether joint inconsistencies remain negligible. revision: yes

  2. Referee: [Validation approaches] No model equations, convergence diagnostics, fraction of missing information, or quantitative comparison metrics (e.g., standardized differences, overlap statistics) are reported, leaving the central claim of 'good internal and external validity' only partially supported by the described validation approaches. (Validation approaches section.)

    Authors: The referee is correct that the manuscript currently provides only a high-level description of the validation approaches without the requested quantitative details. In the revised version we will add the conditional model specifications (including predictor lists and link functions for key variables), convergence diagnostics from the FCS iterations, fraction of missing information values, and quantitative metrics such as standardized mean differences and propensity score overlap statistics for the IPW comparisons. These elements will be placed in the validation approaches section to provide stronger, more transparent support for the validity claims. revision: yes

Circularity Check

0 steps flagged

Standard FCS imputation procedure is self-contained with no circular reductions

full rationale

The paper applies fully conditional specification to impute missing SHARELIFE life-course variables under the MAR assumption, then validates via direct comparisons to observed cases, IPW adjustments, and external SHARE benchmarks. No equations or steps reduce outputs to fitted inputs by construction, no self-citations carry the central validity claim, and no ansatz or uniqueness theorems are smuggled in. The derivation chain rests on established imputation methods and empirical checks that remain falsifiable against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the missing-at-random assumption for the imputation model and on the adequacy of the preliminary data harmonization steps; no free parameters or invented entities are explicitly introduced beyond standard regression coefficients fitted within the FCS procedure.

axioms (1)
  • domain assumption Missing data are missing at random (MAR) conditional on observed covariates.
    This assumption is required for the FCS imputations to be unbiased and is standard in multiple-imputation applications to survey data.

pith-pipeline@v0.9.0 · 5437 in / 1137 out tokens · 27556 ms · 2026-05-09T23:04:15.360816+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references

  1. [1]

    Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

    Diagnostics for multivariate imputations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 2008 , publisher=

  2. [2]

    International statistical review , volume=

    A review of hot deck imputation for survey non-response , author=. International statistical review , volume=. 2010 , publisher=

  3. [3]

    1999 , publisher=

    Conditional specification of statistical models , author=. 1999 , publisher=

  4. [4]

    Statistical Science , volume=

    Conditionally specified distributions: an introduction (with comments and a rejoinder by the authors) , author=. Statistical Science , volume=. 2001 , publisher=

  5. [5]

    2021 , publisher=

    SHARE Wave 8 Methodology: Collecting cross-national survey data in times of COVID-19 , author=. 2021 , publisher=

  6. [6]

    2024 , publisher=

    SHARE wave 9 methodology: From the SHARE corona survey 2 to the SHARE main wave 9 interview , author=. 2024 , publisher=

  7. [7]

    International journal of epidemiology , volume=

    Data resource profile: the Survey of Health, Ageing and Retirement in Europe (SHARE) , author=. International journal of epidemiology , volume=. 2013 , publisher=

  8. [8]

    2013 , howpublished =

    Brugiavini, Agar and Cavapozzi, Danilo and Pasini, Giacomo and Trevisan, Elisabetta , title =. 2013 , howpublished =

  9. [9]

    2019 , howpublished=

    Combining the retrospective interviews of wave 3 and wave 7: the third release of the SHARE Job Episodes Panel , author=. 2019 , howpublished=

  10. [10]

    2022 , publisher=

    Microeconometrics using stata (second edition) , author=. 2022 , publisher=

  11. [11]

    Statistical methods in medical research , volume=

    Sensitivity analysis after multiple imputation under missing at random: a weighting approach , author=. Statistical methods in medical research , volume=. 2007 , publisher=

  12. [12]

    Dynamics of the Official Exchange Rates: USD/RUB (code R01235) , year =

  13. [13]

    Journal of Econometrics , volume=

    Regression with imputed covariates: A generalized missing-indicator approach , author=. Journal of Econometrics , volume=. 2011 , publisher=

  14. [14]

    Journal of Econometrics , volume=

    Model averaging estimation of generalized linear models with imputed covariates , author=. Journal of Econometrics , volume=. 2015 , publisher=

  15. [15]

    and Bergmann, M

    Douhou, S. and Bergmann, M. and Pettinicchi, Y. and Otero, M. C. and Bethmann, A. and De Luca, G. and and B\"orsch-Supan, A.\ , title =. 2025 , note =

  16. [16]

    The Stata Journal , volume=

    Diagnostics for multiple imputation in Stata , author=. The Stata Journal , volume=. 2012 , publisher=

  17. [17]

    Penn World Table---Exchange Rate Series (Assorted Countries) , year =

  18. [18]

    Journal of the American Statistical Association , volume=

    Sampling-based approaches to calculating marginal densities , author=. Journal of the American Statistical Association , volume=. 1990 , publisher=

  19. [19]

    1995 , publisher=

    Bayesian data analysis , author=. 1995 , publisher=

  20. [20]

    Biometrics , volume=

    Multiple imputation for model checking: Completed-data plots with missing and latent data , author=. Biometrics , volume=. 2005 , publisher=

  21. [21]

    European Journal of Population , volume=

    Can we trust older people’s statements on their childhood circumstances? Evidence from SHARELIFE , author=. European Journal of Population , volume=. 2015 , publisher=

  22. [22]

    2025 , howpublished =

    International Monetary Fund , title =. 2025 , howpublished =

  23. [23]

    Statistical methods in medical research , volume=

    Multiple imputation: current perspectives , author=. Statistical methods in medical research , volume=. 2007 , publisher=

  24. [24]

    American journal of epidemiology , volume=

    Multiple imputation for missing data: fully conditional specification versus multivariate normal imputation , author=. American journal of epidemiology , volume=. 2010 , publisher=

  25. [25]

    Journal of Business & Economic Statistics , volume=

    Missing-data adjustments in large surveys , author=. Journal of Business & Economic Statistics , volume=. 1988 , publisher=

  26. [26]

    Cliometrica , year =

    Ljungberg, Jonas , title =. Cliometrica , year =

  27. [27]

    2011 , note=

    A note on how to perform multiple-imputation diagnostics in Stata , author=. 2011 , note=

  28. [28]

    and Williamson, Samuel H

    Officer, Lawrence H. and Williamson, Samuel H. , title =. 2025 , howpublished =

  29. [29]

    Statistical science , pages=

    Multiple-imputation inferences with uncongenial sources of input (with discussion) , author=. Statistical science , pages=. 1994 , publisher=

  30. [30]

    Statistics in medicine , volume=

    Missing values in longitudinal dietary data: a multiple imputation approach based on a fully conditional specification , author=. Statistics in medicine , volume=. 2009 , publisher=

  31. [31]

    , title =

    Officer, Lawrence H. , title =. 2025 , howpublished =

  32. [32]

    Survey methodology , volume=

    A multivariate technique for multiply imputing missing values using a sequence of regression models , author=. Survey methodology , volume=

  33. [33]

    The Stata Journal , volume=

    Multiple imputation of missing values , author=. The Stata Journal , volume=. 2004 , publisher=

  34. [34]

    The Stata Journal , volume=

    Multiple imputation of missing values: update , author=. The Stata Journal , volume=. 2005 , publisher=

  35. [35]

    The Stata Journal , volume=

    Multiple imputation of missing values: update of ice , author=. The Stata Journal , volume=. 2005 , publisher=

  36. [36]

    The Stata Journal , volume=

    Multiple imputation of missing values: further update of ice, with an emphasis on interval censoring , author=. The Stata Journal , volume=. 2007 , publisher=

  37. [37]

    The Stata Journal , volume=

    Multiple imputation of missing values: further update of ice, with an emphasis on categorical variables , author=. The Stata Journal , volume=. 2009 , publisher=

  38. [38]

    Biometrika , volume=

    Inference and missing data , author=. Biometrika , volume=. 1976 , publisher=

  39. [39]

    Journal of Business & Economic Statistics , volume=

    Statistical matching using file concatenation with adjusted weights and multiple imputations , author=. Journal of Business & Economic Statistics , volume=. 1986 , publisher=

  40. [40]

    1987 , publisher=

    Multiple Imputation for Nonresponse in Surveys , author=. 1987 , publisher=

  41. [41]

    Journal of the American Statistical Association , volume=

    Multiple imputation after 18+ years , author=. Journal of the American Statistical Association , volume=. 1996 , publisher=

  42. [42]

    1997 , publisher=

    Analysis of incomplete multivariate data , author=. 1997 , publisher=

  43. [43]

    Computational statistics & data analysis , volume=

    Partially parametric techniques for multiple imputation , author=. Computational statistics & data analysis , volume=. 1996 , publisher=

  44. [44]

    SHARELIFE Methodology , author=

    Retrospective Data Collection in the Survey of Health, Ageing and Retirement in Europe. SHARELIFE Methodology , author=. 2011 , publisher=

  45. [45]

    Population and Development Review , volume =

    Is lowest-low fertility in Europe explained by the postponement of childbearing? , author =. Population and Development Review , volume =. 2004 , publisher =

  46. [46]

    2011 , howpublished =

    Cross-country comparison of monetary values from SHARELIFE , author=. 2011 , howpublished =

  47. [47]

    Statistical methods in medical research , volume=

    Multiple imputation of discrete and continuous data by fully conditional specification , author=. Statistical methods in medical research , volume=. 2007 , publisher=

  48. [48]

    Statistics in medicine , volume=

    Multiple imputation of missing blood pressure covariates in survival analysis , author=. Statistics in medicine , volume=. 1999 , publisher=

  49. [49]

    Journal of statistical computation and simulation , volume=

    Fully conditional specification in multivariate imputation , author=. Journal of statistical computation and simulation , volume=. 2006 , publisher=

  50. [50]

    The Stata Journal , volume=

    Application of multiple imputation using the two-fold fully conditional specification algorithm in longitudinal clinical data , author=. The Stata Journal , volume=. 2014 , publisher=

  51. [51]

    Computational statistics & data analysis , volume=

    Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables , author=. Computational statistics & data analysis , volume=. 2010 , publisher=

  52. [52]

    Statistics in medicine , volume=

    Multiple imputation using chained equations: issues and guidance for practice , author=. Statistics in medicine , volume=. 2011 , publisher=

  53. [53]

    2016 , publisher=

    Introductory econometrics a modern approach (sixth edition) , author=. 2016 , publisher=

  54. [54]

    Statistica Sinica , pages=

    Dissecting multiple imputation from a multi-phase inference perspective: what happens when god's, imputer's and analyst's models are uncongenial? (with discussion) , author=. Statistica Sinica , pages=. 2017 , publisher=