On Sketching Trimmed Statistics

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies efficient linear sketching for trimmed statistics—such as the Fₚ moment of the top-k frequencies and the k-truncated mean—in streaming and distributed settings. We establish, for the first time, a precise space-complexity characterization for sketchability of such statistics, introducing a necessary and sufficient condition based on tail mass. For k ≥ n / polylog n, our algorithm achieves (1±ε)-approximation using only poly(1/ε, log n) space. The framework unifies treatment across p ∈ [0,2], extends to p > 2 and several variants, and attains theoretically optimal space bounds. Empirically, it significantly reduces space usage compared to Count-Sketch while maintaining comparable estimation error. Our core contributions are: (i) establishing the theoretical foundations for sketching trimmed statistics; (ii) providing tight space-complexity characterizations; and (iii) designing practical, provably efficient algorithms with strong empirical performance.

Technology Category

Application Category

📝 Abstract
We present space-efficient linear sketches for estimating trimmed statistics of an $n$-dimensional frequency vector $x$, e.g., the sum of $p$-th powers of the largest $k$ frequencies (i.e., entries) in absolute value, or the $k$-trimmed vector, which excludes the top and bottom $k$ frequencies. This is called the $F_p$ moment of the trimmed vector. Trimmed measures are used in robust estimation, as seen in the R programming language's `trim.var' function and the `trim' parameter in the mean function. Linear sketches improve time and memory efficiency and are applicable to streaming and distributed settings. We initiate the study of sketching these statistics and give a new condition for capturing their space complexity. When $k ge n/polylog n$, we give a linear sketch using $poly(1/varepsilon, log n)$ space which provides a $(1 pm varepsilon)$ approximation to the top-$k$ $F_p$ moment for $p in [0,2]$. For general $k$, we give a sketch with the same guarantees under a condition relating the $k$-th largest frequency to the tail mass, and show this condition is necessary. For the $k$-trimmed version, our sketch achieves optimal error guarantees under the same condition. We extend our methods to $p>2$ and also address related problems such as computing the $F_p$ moment of frequencies above a threshold, finding the largest $k$ such that the $F_p$ moment of the top $k$ exceeds $k^{p+1}$, and the $F_p$ moment of the top $k$ frequencies such that each entry is at least $k$. Notably, our algorithm for this third application improves upon the space bounds of the algorithm of Govindan, Monemizadeh, and Muthukrishnan (PODS '17) for computing the $h$-index. We show empirically that our top $k$ algorithm uses much less space compared to Count-Sketch while achieving the same error.
Problem

Research questions and friction points this paper is trying to address.

Estimating trimmed statistics of frequency vectors efficiently
Developing space-efficient linear sketches for robust estimation
Improving space complexity for top-k frequency moments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Space-efficient linear sketches for trimmed statistics
Approximation guarantees for top-k F_p moments
Optimal error for k-trimmed vector under conditions
🔎 Similar Papers
No similar papers found.