How improve Set Similarity Join based on prefix approach in distributed environment

Conference Paper

Publication Date:

2018

abstract:

Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques.
To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3].

Iris type:

4.1 Contributo in Atti di convegno

Keywords:

Similarity Join; Big Data; Record Linkage

List of contributors:

Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico

Authors of the University:

GAGLIARDELLI LUCA

Handle:

https://iris.uniecampus.it/handle/11389/69839

Book title:

2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16-20, 2018.