Skip to Main Content (Press Enter)

Logo UNIECAMPUS
  • ×
  • Home
  • Degrees
  • Courses
  • Jobs
  • People
  • Outputs
  • Organizations
  • Third Mission
  • Expertise & Skills

UNI-FIND
Logo UNIECAMPUS

|

UNI-FIND

uniecampus.it
  • ×
  • Home
  • Degrees
  • Courses
  • Jobs
  • People
  • Outputs
  • Organizations
  • Third Mission
  • Expertise & Skills
  1. Outputs

How improve Set Similarity Join based on prefix approach in distributed environment

Conference Paper
Publication Date:
2018
abstract:
Set similarity join is an essential operation in data integration and big data analytics, that finds similar pairs of records where the records contain string or set-based data. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity joins using distributed frameworks, such as the MapReduce framework. In particular, Vernica et al. [3] proposed a MapReduce implementation of the so-called PPJoin algorithm [2], which in a recent study, was experimentally demonstrated as one of the best set similarity join algorithm [4]. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques.
To address these problems, in this paper, we propose a duplicate-free framework, called TTJoin, to perform set simi- larity joins efficiently by utilizing an innovative filter based on prefix tokens and we implement it with one of most popular distributed framework, i.e., Apache Spark. Experiments on real world datasets demonstrate the effectiveness of proposed solution with respect to either traditional PPJoin and the MapReduce implementation proposed in [3].
Iris type:
4.1 Contributo in Atti di convegno
Keywords:
Similarity Join; Big Data; Record Linkage
List of contributors:
Zhu, Song; Gagliardelli, Luca; Simonini, Giovanni; Beneventano, Domenico
Authors of the University:
GAGLIARDELLI LUCA
Handle:
https://iris.uniecampus.it/handle/11389/69839
Book title:
2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16-20, 2018.
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.6.0.0