pith. sign in

arxiv: 2605.20943 · v1 · pith:7OFRGZ7Cnew · submitted 2026-05-20 · 📊 stat.ME

Missing data and cluster graphs: cluster-level missingness vs variable-level missingness

Pith reviewed 2026-05-21 02:15 UTC · model grok-4.3

classification 📊 stat.ME
keywords missing datacluster graphsrecoverabilitycausal effectsgraphical modelscompatibilitymacro causal inference
0
0 comments X

The pith

Graphical conditions on cluster missingness graphs recover joint distributions and macro causal effects

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Missing data often occurs in grouped clusters of variables, such as in surveys where only cluster structure is known. This paper introduces m-C-DMG and cm-C-DMG as abstract representations of missingness at the cluster level. It formalizes compatibility between these graphs and more detailed variable-level missingness models. Graphical criteria are provided for when the joint distribution can be recovered and when macro causal effects are identifiable from the observed data. This clarifies when cluster-level information alone supports valid inference.

Core claim

Under the assumption of compatibility between cluster-level missingness graphs and underlying variable-level models, the paper shows that graphical conditions in m-C-DMG and cm-C-DMG determine the recoverability of the joint distribution as well as of macro causal effects. This provides a way to assess if coarse structural information about missingness is sufficient for probabilistic and causal queries.

What carries the argument

The compatibility notion between abstract cluster missingness graphs (m-C-DMG retaining variable-specific indicators and cm-C-DMG aggregating at cluster level) and variable-level models, allowing graphical recoverability criteria to apply.

Load-bearing premise

Compatibility between the abstract cluster graphs and the actual variable-level missingness mechanisms must hold, otherwise the graphical conditions may not guarantee recoverability.

What would settle it

A simulation study or theoretical counterexample showing a case where the cluster graph meets the graphical criteria under compatibility but the joint distribution is not actually recoverable from the data.

Figures

Figures reproduced from arXiv: 2605.20943 by Charles Assaad, Eugenio Valdano, Willow Scott.

Figure 1
Figure 1. Figure 1: Two ADMGs in (a) and (b) compatible with the same C-DMG in (c). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A m-C-DMG (a) and a cm-C-DMG (b). Figure 2b is the conceptual graph of the clinical [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: cm-DMG where the macro causal effect of CX on CY is recoverable but the joint distribution is not. Definition 4 (Rules of the do-calculus in m-C-DMGs and cm-C-DMGs). The three following rules of the do-calculus are the following: Rule 1:Pr(cy ∣ do(cz), cx, cw) = Pr(cy ∣ do(cz), cw) if CY ⊧G C,∗m CZ CX ∣ CZ,CW Rule 2:Pr(cy ∣ do(cz), do(cx), cw) = Pr(cy ∣ do(cz), cx, cw) if CY ⊧G C,∗m CZCX CX ∣ CZ,CW Rule 3:… view at source ↗
read the original abstract

Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces m-C-DMG and cm-C-DMG as abstract cluster-level missingness graphs, formalizes a compatibility relation to underlying variable-level missingness models, and supplies graphical criteria for recovering the joint distribution as well as a macro causal effect when only cluster-level information is available.

Significance. If the compatibility conditions and graphical recoverability criteria are shown to be sound, the work supplies a practical bridge between fully specified variable-level missing-data graphs and the coarser cluster representations that arise in public-health and social-science applications. The explicit treatment of both probabilistic and causal queries under the two graph classes is a clear strength.

major comments (2)
  1. [§4.2] §4.2 (Compatibility Definition): compatibility is defined via structural matching of clusters and edges between the abstract graph and a variable-level model. This matching does not explicitly constrain higher-order conditional dependencies among missingness indicators inside a cluster; consequently the d-separation statements used in the recoverability theorems of §5 may fail to hold uniformly for every compatible variable-level completion.
  2. [Theorem 3] Theorem 3 (Macro causal effect recovery under cm-C-DMG): the stated graphical criterion assumes that any two variable-level models compatible with the same cm-C-DMG induce identical recoverability verdicts. No explicit argument or counter-example check is supplied to confirm that differing intra-cluster missingness dependencies cannot alter the relevant conditional independencies.
minor comments (2)
  1. [Figure 1] Notation for the aggregated missingness indicator in cm-C-DMG is introduced without a dedicated legend in Figure 1; adding an explicit key would improve readability.
  2. [§3] The running example in §3 mixes binary and continuous variables without stating whether the graphical criteria are intended to be distribution-free or require additional parametric assumptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments highlighting potential gaps in the compatibility definition and the robustness of the recoverability results. We address each point below and will incorporate clarifications and supporting arguments in the revision.

read point-by-point responses
  1. Referee: §4.2 (Compatibility Definition): compatibility is defined via structural matching of clusters and edges between the abstract graph and a variable-level model. This matching does not explicitly constrain higher-order conditional dependencies among missingness indicators inside a cluster; consequently the d-separation statements used in the recoverability theorems of §5 may fail to hold uniformly for every compatible variable-level completion.

    Authors: We agree the definition emphasizes structural matching. The d-separations invoked in §5 are inter-cluster and are preserved by the compatibility relation even if intra-cluster higher-order dependencies exist, because such dependencies are confined within clusters and do not create new paths between clusters. We will revise §4.2 to add an explicit remark and a short supporting argument establishing this preservation. revision: yes

  2. Referee: Theorem 3 (Macro causal effect recovery under cm-C-DMG): the stated graphical criterion assumes that any two variable-level models compatible with the same cm-C-DMG induce identical recoverability verdicts. No explicit argument or counter-example check is supplied to confirm that differing intra-cluster missingness dependencies cannot alter the relevant conditional independencies.

    Authors: The observation is fair; the manuscript does not supply an explicit verification. We will augment the proof of Theorem 3 with a brief argument showing that intra-cluster dependencies cannot change the macro-level d-separations relevant to recoverability, as the cm-C-DMG aggregates mechanisms at the cluster level and compatibility ensures the relevant independencies are determined solely by the abstract graph. revision: yes

Circularity Check

0 steps flagged

No significant circularity; graphical recoverability conditions derived independently

full rationale

The paper develops new abstract graph classes (m-C-DMG, cm-C-DMG) and compatibility conditions from first principles in graphical causal models for missing data. Recoverability criteria for the joint distribution and macro causal effects are stated as consequences of d-separation and missingness indicator properties under these abstractions. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on explicit graphical criteria that do not reduce to the inputs by construction, making the theory self-contained against external benchmarks in missing-data graphical modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on standard graphical model assumptions plus the newly introduced compatibility notion between cluster and variable-level graphs. No free parameters are mentioned. Two new graph classes are defined.

axioms (1)
  • standard math Standard assumptions of directed mixed graphs for missingness mechanisms (e.g., no unmeasured confounding within the modeled structure)
    Invoked when defining recoverability of probabilistic and causal queries from the graphs.
invented entities (2)
  • m-C-DMG (cluster-based missingness graph retaining variable-specific indicators) no independent evidence
    purpose: Abstract representation of missingness at cluster level while keeping some variable detail
    Newly defined class to bridge variable-level and cluster-level modeling
  • cm-C-DMG (cluster-level aggregated missingness graph) no independent evidence
    purpose: Fully aggregated missingness mechanism at the cluster level
    Newly defined class for coarsest abstraction

pith-pipeline@v0.9.0 · 5711 in / 1381 out tokens · 34449 ms · 2026-05-21T02:15:31.639191+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    2009 , isbn =

    Pearl, Judea , title =. 2009 , isbn =

  2. [2]

    Graphical Models for Inference with Missing Data , volume =

    Mohan, Karthika and Pearl, Judea and Tian, Jin , booktitle =. Graphical Models for Inference with Missing Data , volume =

  3. [3]

    Journal of the American Statistical Association , volume =

    Karthika Mohan and Judea Pearl , title =. Journal of the American Statistical Association , volume =. 2021 , publisher =

  4. [4]

    Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , volume =

    Mohan, Karthika and Pearl, Judea , booktitle =. Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , volume =

  5. [5]

    , title =

    Rubin, Donald B. , title =. Biometrika , volume =. 1976 , month =

  6. [6]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Identifying Macro Conditional Independencies and Macro Total Effects in Summary Causal Graphs with Latent Confounding , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i25.34882 , number=

  7. [7]

    and Ribeiro, Adele H

    Anand, Tara V. and Ribeiro, Adele H. and Tian, Jin and Bareinboim, Elias , title =. 2023 , isbn =. doi:10.1609/aaai.v37i10.26435 , booktitle =

  8. [8]

    Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence , pages =

    Identifiability of total effects from abstractions of time series causal graphs , author =. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence , pages =. 2024 , editor =

  9. [9]

    Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =

    Root Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with Loops , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =

  10. [10]

    AMIA Annual Symposium Proceedings , year=

    Leveraging Cluster Causal Diagrams for Determining Causal Effects in Medicine , author=. AMIA Annual Symposium Proceedings , year=

  11. [11]

    Causal Inference on Time Series using Restricted Structural Equation Models , volume =

    Peters, Jonas and Janzing, Dominik and Sch\". Causal Inference on Time Series using Restricted Structural Equation Models , volume =. Advances in Neural Information Processing Systems , editor =

  12. [12]

    Epidemiology , year=

    Causal Effect of Chronic Pain on Mortality Through Opioid Prescriptions: Application of the Front-Door Formula , author=. Epidemiology , year=

  13. [13]

    Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches, and Specific Examples , year =

    Wothke, Werner , title =. Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches, and Specific Examples , year =