SHAP zero Explains Genomic Models with Near-zero Marginal Cost for Future Queried Sequences

📅 2024-10-25

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

The exponential computational cost of Shapley value estimation hinders interpretability of large-scale biological sequence models. Method: This paper establishes, for the first time, a theoretical connection between Shapley values and the model’s Fourier transform over the Boolean hypercube, and proposes an analytical reconstruction framework leveraging spectral sketching and sparse spectrum estimation. This enables an “instant modeling, zero marginal cost” incremental explanation paradigm. Contribution/Results: The method drastically reduces computational overhead for high-order feature interactions: amortized costs drop by 2–3 orders of magnitude on gRNA binding and DNA repair prediction tasks. It systematically recovers nearly all known critical biological motifs—demonstrating unprecedented fidelity in global attribution. To our knowledge, this work provides the first efficient, scalable theoretical and algorithmic foundation for globally interpreting genome-scale foundation models.

Technology Category

Application Category

📝 Abstract

With the rapid growth of large-scale machine learning models in genomics, Shapley values have emerged as a popular method for model explanations due to their theoretical guarantees. While Shapley values explain model predictions locally for an individual input query sequence, extracting biological knowledge requires global explanation across thousands of input sequences. This demands exponential model evaluations per sequence, resulting in significant computational cost and carbon footprint. Herein, we develop SHAP zero, a method that estimates Shapley values and interactions with a near-zero marginal cost for future queried sequences after paying a one-time fee for model sketching. SHAP zero achieves this by establishing a surprisingly underexplored connection between the Shapley values and interactions and the Fourier transform of the model. Explaining two genomic models, one trained to predict guide RNA binding and the other to predict DNA repair outcome, we demonstrate that SHAP zero achieves orders of magnitude reduction in amortized computational cost compared to state-of-the-art algorithms, revealing almost all predictive motifs -- a finding previously inaccessible due to the combinatorial space of possible interactions.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational cost for Shapley value explanations in biological sequences

Enabling scalable global insights from interpretable machine learning models

Uncovering feature interactions efficiently in black-box sequence models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Amortizes Shapley computation cost across datasets

Links Shapley values to sparse Fourier transforms

Enables near-zero marginal cost for future queries

🔎 Similar Papers

Improving the Weighting Strategy in KernelSHAP