🤖 AI Summary
This paper addresses the problem of violations of expected monotonic trends in aggregate values within datasets. We introduce Aggregate Order Dependencies (AODs), which formally require that the aggregate value (e.g., mean) of a target attribute strictly increases or decreases with respect to a total order on grouping attributes. We propose the first “aggregate-centered” AOD variant, rigorously characterize its computational complexity as NP-hard, and develop a generic algorithmic framework complemented by efficient heuristic strategies. Our approach integrates database repair, combinatorial optimization, and statistical group analysis, supporting diverse aggregate functions (e.g., AVG, SUM, COUNT). Experiments on real and synthetic datasets demonstrate that our algorithms are both efficient and scalable, with heuristics yielding substantial speedups. Case studies successfully detect and interpret non-monotonic anomalies in domain-specific relationships—such as education investment vs. enrollment rates, housing prices vs. neighborhood rankings, and disease incidence vs. age—thereby exposing underlying data quality issues.
📝 Abstract
Datasets often exhibit violations of expected monotonic trends - for example, higher education level correlating with higher average salary, newer homes being more expensive, or diabetes prevalence increasing with age. We address the problem of quantifying how far a dataset deviates from such trends. To this end, we introduce Aggregate Order Dependencies (AODs), an aggregation-centric extension of the previously studied order dependencies. An AOD specifies that the aggregated value of a target attribute (e.g., mean salary) should monotonically increase or decrease with the grouping attribute (e.g., education level).
We formulate the AOD repair problem as finding the smallest set of tuples to delete from a table so that the given AOD is satisfied. We analyze the computational complexity of this problem and propose a general algorithmic template for solving it. We instantiate the template for common aggregation functions, introduce optimization techniques that substantially improve the runtime of the template instances, and develop efficient heuristic alternatives. Our experimental study, carried out on both real-world and synthetic datasets, demonstrates the practical efficiency of the algorithms and provides insight into the performance of the heuristics. We also present case studies that uncover and explain unexpected AOD violations using our framework.