Missing data and cluster graphs: cluster-level missingness vs variable-level missingness
Pith reviewed 2026-05-21 02:15 UTC · model grok-4.3
The pith
Graphical conditions on cluster missingness graphs recover joint distributions and macro causal effects
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption of compatibility between cluster-level missingness graphs and underlying variable-level models, the paper shows that graphical conditions in m-C-DMG and cm-C-DMG determine the recoverability of the joint distribution as well as of macro causal effects. This provides a way to assess if coarse structural information about missingness is sufficient for probabilistic and causal queries.
What carries the argument
The compatibility notion between abstract cluster missingness graphs (m-C-DMG retaining variable-specific indicators and cm-C-DMG aggregating at cluster level) and variable-level models, allowing graphical recoverability criteria to apply.
Load-bearing premise
Compatibility between the abstract cluster graphs and the actual variable-level missingness mechanisms must hold, otherwise the graphical conditions may not guarantee recoverability.
What would settle it
A simulation study or theoretical counterexample showing a case where the cluster graph meets the graphical criteria under compatibility but the joint distribution is not actually recoverable from the data.
Figures
read the original abstract
Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces m-C-DMG and cm-C-DMG as abstract cluster-level missingness graphs, formalizes a compatibility relation to underlying variable-level missingness models, and supplies graphical criteria for recovering the joint distribution as well as a macro causal effect when only cluster-level information is available.
Significance. If the compatibility conditions and graphical recoverability criteria are shown to be sound, the work supplies a practical bridge between fully specified variable-level missing-data graphs and the coarser cluster representations that arise in public-health and social-science applications. The explicit treatment of both probabilistic and causal queries under the two graph classes is a clear strength.
major comments (2)
- [§4.2] §4.2 (Compatibility Definition): compatibility is defined via structural matching of clusters and edges between the abstract graph and a variable-level model. This matching does not explicitly constrain higher-order conditional dependencies among missingness indicators inside a cluster; consequently the d-separation statements used in the recoverability theorems of §5 may fail to hold uniformly for every compatible variable-level completion.
- [Theorem 3] Theorem 3 (Macro causal effect recovery under cm-C-DMG): the stated graphical criterion assumes that any two variable-level models compatible with the same cm-C-DMG induce identical recoverability verdicts. No explicit argument or counter-example check is supplied to confirm that differing intra-cluster missingness dependencies cannot alter the relevant conditional independencies.
minor comments (2)
- [Figure 1] Notation for the aggregated missingness indicator in cm-C-DMG is introduced without a dedicated legend in Figure 1; adding an explicit key would improve readability.
- [§3] The running example in §3 mixes binary and continuous variables without stating whether the graphical criteria are intended to be distribution-free or require additional parametric assumptions.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting potential gaps in the compatibility definition and the robustness of the recoverability results. We address each point below and will incorporate clarifications and supporting arguments in the revision.
read point-by-point responses
-
Referee: §4.2 (Compatibility Definition): compatibility is defined via structural matching of clusters and edges between the abstract graph and a variable-level model. This matching does not explicitly constrain higher-order conditional dependencies among missingness indicators inside a cluster; consequently the d-separation statements used in the recoverability theorems of §5 may fail to hold uniformly for every compatible variable-level completion.
Authors: We agree the definition emphasizes structural matching. The d-separations invoked in §5 are inter-cluster and are preserved by the compatibility relation even if intra-cluster higher-order dependencies exist, because such dependencies are confined within clusters and do not create new paths between clusters. We will revise §4.2 to add an explicit remark and a short supporting argument establishing this preservation. revision: yes
-
Referee: Theorem 3 (Macro causal effect recovery under cm-C-DMG): the stated graphical criterion assumes that any two variable-level models compatible with the same cm-C-DMG induce identical recoverability verdicts. No explicit argument or counter-example check is supplied to confirm that differing intra-cluster missingness dependencies cannot alter the relevant conditional independencies.
Authors: The observation is fair; the manuscript does not supply an explicit verification. We will augment the proof of Theorem 3 with a brief argument showing that intra-cluster dependencies cannot change the macro-level d-separations relevant to recoverability, as the cm-C-DMG aggregates mechanisms at the cluster level and compatibility ensures the relevant independencies are determined solely by the abstract graph. revision: yes
Circularity Check
No significant circularity; graphical recoverability conditions derived independently
full rationale
The paper develops new abstract graph classes (m-C-DMG, cm-C-DMG) and compatibility conditions from first principles in graphical causal models for missing data. Recoverability criteria for the joint distribution and macro causal effects are stated as consequences of d-separation and missingness indicator properties under these abstractions. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on explicit graphical criteria that do not reduce to the inputs by construction, making the theory self-contained against external benchmarks in missing-data graphical modeling.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of directed mixed graphs for missingness mechanisms (e.g., no unmeasured confounding within the modeled structure)
invented entities (2)
-
m-C-DMG (cluster-based missingness graph retaining variable-specific indicators)
no independent evidence
-
cm-C-DMG (cluster-level aggregated missingness graph)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce two classes of cluster-based missingness graphs: the m-C-DMG... and the cm-C-DMG... We formalize the notion of compatibility... and give graphical conditions for recovering the joint distribution as well as... a macro causal effect.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1... necessary and sufficient condition for recovering the joint distribution Pr(C) is the absence of any vertex CX... neighbors or collider path
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Graphical Models for Inference with Missing Data , volume =
Mohan, Karthika and Pearl, Judea and Tian, Jin , booktitle =. Graphical Models for Inference with Missing Data , volume =
-
[3]
Journal of the American Statistical Association , volume =
Karthika Mohan and Judea Pearl , title =. Journal of the American Statistical Association , volume =. 2021 , publisher =
work page 2021
-
[4]
Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , volume =
Mohan, Karthika and Pearl, Judea , booktitle =. Graphical Models for Recovering Probabilistic and Causal Queries from Missing Data , volume =
- [5]
-
[6]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Identifying Macro Conditional Independencies and Macro Total Effects in Summary Causal Graphs with Latent Confounding , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i25.34882 , number=
-
[7]
Anand, Tara V. and Ribeiro, Adele H. and Tian, Jin and Bareinboim, Elias , title =. 2023 , isbn =. doi:10.1609/aaai.v37i10.26435 , booktitle =
-
[8]
Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence , pages =
Identifiability of total effects from abstractions of time series causal graphs , author =. Proceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence , pages =. 2024 , editor =
work page 2024
-
[9]
Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =
Root Cause Identification for Collective Anomalies in Time Series given an Acyclic Summary Causal Graph with Loops , author =. Proceedings of The 26th International Conference on Artificial Intelligence and Statistics , pages =. 2023 , editor =
work page 2023
-
[10]
AMIA Annual Symposium Proceedings , year=
Leveraging Cluster Causal Diagrams for Determining Causal Effects in Medicine , author=. AMIA Annual Symposium Proceedings , year=
-
[11]
Causal Inference on Time Series using Restricted Structural Equation Models , volume =
Peters, Jonas and Janzing, Dominik and Sch\". Causal Inference on Time Series using Restricted Structural Equation Models , volume =. Advances in Neural Information Processing Systems , editor =
-
[12]
Causal Effect of Chronic Pain on Mortality Through Opioid Prescriptions: Application of the Front-Door Formula , author=. Epidemiology , year=
-
[13]
Wothke, Werner , title =. Modeling Longitudinal and Multilevel Data: Practical Issues, Applied Approaches, and Specific Examples , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.