Missing data and cluster graphs: cluster-level missingness vs variable-level missingness

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the problem of determining whether probabilistic and causal queries are recoverable when only coarse-grained information—specifically, the clustering structure of variables—is known about the missingness mechanism. The authors introduce two novel cluster-level missing data graphical models, m-C-DMG and cm-C-DMG, and establish, for the first time, a systematic compatibility theory linking these models to variable-level missingness mechanisms. They provide graphical criteria for the recoverability of joint distributions and macro-level causal effects. By delineating the conditions under which cluster-level information suffices for valid inference, this study clarifies when coarse-grained modeling is adequate and when a more refined characterization of the missingness mechanism is necessary, thereby offering a new paradigm for causal inference under missing data.

📝 Abstract

Missing data is pervasive in many scientific domains such as public health, environmental science, and the social sciences. Recoverability from missing data is typically studied using fully specified variable-level missingness models despite that, in many applications, only coarse structural information is available, for instance when variables are grouped into clusters due to limited knowledge or interpretability reasons. In this paper, we investigate recoverability from such abstract representations. We introduce two classes of cluster-based missingness graphs: the m-C-DMG, which retains variable-specific missingness indicators, and the cm-C-DMG, which aggregates missingness mechanisms at the cluster level. We formalize the notion of compatibility between these abstract graphs and underlying variable-level missingness models, and study how this abstraction affects the recoverability of probabilistic and causal queries. In particular, we give graphical conditions of recovering the joint distribution as well as graphical conditions of recovering a macro causal effect. Overall, our results clarify when cluster-level missingness information is sufficient for valid inference, and when finer-grained modeling is necessary.

Problem

Research questions and friction points this paper is trying to address.

missing data

cluster graphs

recoverability

missingness mechanisms

causal inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

cluster-based missingness

missing data recoverability

causal inference