🤖 AI Summary
Identifying manually defined subgroups in high-dimensional data is challenging, and existing methods lack support for exploratory subgroup discovery. To address this, we propose a novel paradigm—exploratory subgroup analysis—and introduce the “Subgroup Map,” a visual analytics framework enabling dynamic reordering, interactive refinement, and quantitative analysis of subgroup coverage relationships. We further design a lightweight, approximate subgroup discovery algorithm, seamlessly integrated into Jupyter notebooks. Our approach balances computational efficiency with interpretability, supporting automatic subgroup detection, multi-dimensional evaluation (e.g., statistical significance, coverage, divergence), and iterative filtering. In an empirical study with 13 data scientists, our method significantly improved both the speed and depth of subgroup discovery, successfully uncovering unexpected feature interactions and fine-grained distributional biases. The Subgroup Map provides a new analytical tool for understanding data heterogeneity and model behavior.
📝 Abstract
Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.