Accurate Estimation of Mutual Information in High Dimensional Data

📅 2025-05-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Accurate mutual information (MI) estimation in high-dimensional data has long been hindered by severe undersampling and the absence of principled reliability assessment criteria. This paper introduces the first MI estimation protocol with explicit reliability testing and statistical consistency guarantees. Methodologically, it integrates self-supervised representation learning, embedding dimensionality analysis, training dynamics monitoring, and statistical consistency verification. Crucially, we establish the first theoretical result showing that reliable MI estimation remains feasible—even under extreme undersampling—whenever the high-dimensional data admits an exact low-dimensional representation. Building upon this insight, we develop the first verifiable framework for quantifying MI estimator quality. Extensive evaluation on multiple high-dimensional benchmark datasets demonstrates controlled estimation error, strong robustness, and significantly expanded practical applicability of MI in real-world high-dimensional data analysis.

Technology Category

Application Category

📝 Abstract

Mutual information (MI) is a measure of statistical dependencies between two variables, widely used in data analysis. Thus, accurate methods for estimating MI from empirical data are crucial. Such estimation is a hard problem, and there are provably no estimators that are universally good for finite datasets. Common estimators struggle with high-dimensional data, which is a staple of modern experiments. Recently, promising machine learning-based MI estimation methods have emerged. Yet it remains unclear if and when they produce accurate results, depending on dataset sizes, statistical structure of the data, and hyperparameters of the estimators, such as the embedding dimensionality or the duration of training. There are also no accepted tests to signal when the estimators are inaccurate. Here, we systematically explore these gaps. We propose and validate a protocol for MI estimation that includes explicit checks ensuring reliability and statistical consistency. Contrary to accepted wisdom, we demonstrate that reliable MI estimation is achievable even with severely undersampled, high-dimensional datasets, provided these data admit accurate low-dimensional representations. These findings broaden the potential use of machine learning-based MI estimation methods in real-world data analysis and provide new insights into when and why modern high-dimensional, self-supervised algorithms perform effectively.

Problem

Research questions and friction points this paper is trying to address.

Accurate MI estimation in high-dimensional data is challenging

Existing methods lack reliability checks for estimator accuracy

Low-dimensional representations enable reliable MI estimation in undersampled data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Machine learning-based MI estimation methods

Protocol with reliability and consistency checks

Low-dimensional representations for high-dimensional data

🔎 Similar Papers

Improving Numerical Stability of Normalized Mutual Information Estimator on High Dimensions

2024-10-10arXiv.orgCitations: 0

Authors to Follow