Testing composite null hypotheses with high-dimensional dependent data: a computationally scalable FDR-controlling procedure

šŸ“… 2024-04-08
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
In high-dimensional dependent data, conventional multiple testing procedures for composite null hypotheses often fail to control the false discovery rate (FDR) due to neglect of underlying dependence structures. Method: This paper proposes a novel framework for reproducibility analysis across multiple studies. It introduces a four-state hidden Markov model to jointly capture dependency and heterogeneity in p-value sequences from two primary studies, and develops an e-value-based scalable synthesis framework that reduces the computational complexity of joint inference across *n* studies from exponential to *O(n²)*, while guaranteeing asymptotic FDR control. Contribution/Results: Extensive simulations and real-world genome-wide association study (GWAS) analyses demonstrate that, at the same nominal FDR level, the proposed method achieves substantially higher statistical power and successfully identifies biologically meaningful mechanisms missed by standard approaches.

Technology Category

Application Category

šŸ“ Abstract
Testing composite null hypotheses arises in various applications, such as mediation and replicability analyses. The problem becomes more challenging in high-throughput experiments where tens of thousands of features are examined simultaneously. Existing large-scale inference methods for composite null hypothesis testing often fail to explicitly incorporate the dependence structure, producing overly conservative or overly liberal results. In this work, we first develop a four-state hidden Markov model (HMM) to model a bivariate $p$-value sequence from replicability analysis with two studies, accounting for local feature dependence and study heterogeneity. Building on the HMM, we propose a multiple testing procedure that controls the false discovery rate (FDR). Extending the HMM to model the $p$-values from $n$ studies requires a computational cost of exponential order of $n$. To address this challenge, we introduce a novel e-value framework that reduces the computational cost to quadratic growth in the number of studies while maintaining FDR control. We show that the proposed method asymptotically controls the FDR and exhibits higher power numerically than competing methods at the same FDR level. In a real data application to genome-wide association studies (GWAS), our method reveals new biological insights that are overlooked by existing methods.
Problem

Research questions and friction points this paper is trying to address.

Testing composite null hypotheses with high-dimensional dependent data
Incorporating dependence structure in large-scale inference methods
Reducing computational cost while maintaining FDR control
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses four-state HMM for bivariate p-value modeling
Introduces e-value framework for computational efficiency
Ensures FDR control in high-dimensional data
šŸ”Ž Similar Papers
No similar papers found.
Pengfei Lyu
Pengfei Lyu
Ph.D student at Northeastern University
Machine LearningComputer visionMulti-modal image processing
X
Xianyang Zhang
Department of Statistics, Texas A&M University, College Station, TX, 77843, USA
H
Hongyuan Cao
Department of Statistics, Florida State University, Tallahassee, FL 32306, USA