🤖 AI Summary
This paper investigates the *k*-way projection reconstruction problem for string sets: given unlabeled projections onto all *k*-element position subsets, under what conditions is the original set uniquely reconstructible? It further examines single-string reconstruction and characterizes the *k*-wise independence threshold—the largest *k* for which projections completely erase distinguishing information. Methodologically, the authors introduce a combinatorial modeling framework based on *non-contiguous k-mers*, extending overlap graph algorithms to handle arbitrary (non-adjacent) position projections—thereby relaxing the conventional contiguous-*k*-mer assumption. They define the *k*-wise independence critical point and establish a parameterized complexity model. Theoretically, they prove the problem is NP-hard and inapproximable in general, yet fixed-parameter tractable when either string length or *k* is bounded. Empirical evaluation demonstrates that their algorithm achieves high efficiency and scalability across multi-scale datasets.
📝 Abstract
Graphs are a powerful tool for analyzing large data sets, but many real-world phenomena involve interactions that go beyond the simple pairwise relationships captured by a graph. In this paper we introduce and study a simple combinatorial model to capture higher order dependencies from an algorithms and computational complexity perspective. Specifically, we introduce the String Set Reconstruction problem, which asks when a set of strings can be reconstructed from seeing only the k-way projections of strings in the set. This problem is distinguished from genetic reconstruction problems in that we allow projections from any k indices and we maintain knowledge of those indices, but not which k-mer came from which string. We give several results on the complexity of this problem, including hardness results, inapproximability, and parametrized complexity.
Our main result is the introduction of a new algorithm for this problem using a modified version of overlap graphs from genetic reconstruction algorithms. A key difference we must overcome is that in our setting the k-mers need not be contiguous, unlike the setting of genetic reconstruction. We exhibit our algorithm's efficiency in a variety of experiments, and give high-level explanations for how its complexity is observed to scale with various parameters. We back up these explanation with analytic approximations. We also consider the related problems of: whether a single string can be reconstructed from the k-way projections of a given set of strings, and finding the largest k at which we get no information about the original data set from its k-way projections (i.e., the largest $k$ for which it is "k-wise independent").