An Efficient Candidate-Free R-S Set Similarity Join Algorithm with the Filter-and-Verification Tree and MapReduce

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

To address the I/O and verification overhead caused by candidate pair explosion in large-scale set similarity R-S Join, this paper proposes a novel single-stage filtering-and-verification framework that eliminates candidate set generation entirely. The core innovation lies in the design of the Filter-Verification Tree (FVT) and its linear variant (LFVT), which tightly integrate traditional two-phase filtering and verification into a unified, one-step decision procedure. This work further pioneers the first zero-candidate R-S Join implementation under the MapReduce paradigm. By incorporating memory-efficient compressed indexing and parallelization optimizations, the proposed algorithms—MR-CF-RS-Join and LFVT—achieve substantial reductions in execution time, memory consumption, and disk I/O across seven real-world datasets. Experimental results demonstrate superior end-to-end performance compared to state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Given two different collections of sets, the exact set similarity R-S Join finds all set pairs with similarity no less than a given threshold, which has widespread applications. While existing algorithms accelerate large-scale R-S Joins using a two-stage filter-and-verification framework along with the parallel and distributed MapReduce framework, they suffer from excessive candidate set pairs, leading to significant I/O, data transfer, and verification overhead, and ultimately degrading the performance. This paper proposes novel candidate-free R-S Join (CF-RS-Join) algorithms that integrate filtering and verification into a single stage through filter-and-verification trees (FVTs) and their linear variants (LFVTs). First, CF-RS-Join with FVT (CF-RS-Join/FVT) is proposed to leverage an innovative FVT structure that compresses elements and associated sets in memory, enabling single-stage processing that eliminates the candidate set generation, fast lookups, and reduced database scans. Correctness proofs are provided. Second, CF-RS-Join with LFVT (CF-RS-Join/LFVT) is proposed to exploit a more compact Linear FVT, which compresses non-branching paths into single nodes and stores them in linear arrays for optimized traversal. Third, MR-CF-RS-Join/FVT and MR-CF-RS-Join/LFVT have been proposed to extend our approaches using MapReduce for parallel processing. Empirical studies on 7 real-world datasets have been conducted to evaluate the performance of the proposed algorithms against selected existing algorithms in terms of execution time, scalability, memory usage, and disk usage. Experimental results demonstrate that our algorithm using MapReduce, i.e., MR-CF-RS-Join/LFVT, achieves the best performance.

Problem

Research questions and friction points this paper is trying to address.

Efficiently find similar set pairs without candidate generation

Reduce I/O and verification overhead in set similarity joins

Enable parallel processing using MapReduce for large-scale datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-stage processing with filter-and-verification trees

Linear FVT for optimized traversal and storage

MapReduce integration for parallel processing

🔎 Similar Papers

TreeTracker Join: Simple, Optimal, Fast

2024-03-03Citations: 1

💼 Related Jobs

Research Scientist