An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark

Dehua Chen; Changgan Shen; Jieying Feng; Jiajin Le

An Efficient Parallel Top-k Similarity Join for Massive Multidimensional Data Using Spark

원문정보

Dehua Chen, Changgan Shen, Jieying Feng, Jiajin Le

보안공학연구지원센터(IJDTA) International Journal of Database Theory and Application Vol.8 No.3 2015.06 pp.57-68 SCOPUS

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

Top-k similarity join has been used in a wide range of applications that require calculating the most top-k similar pairs of data records in a given database. However, the time performance will be a challenging problem, as an increasing trend of applications that need to process massive data. Obviously, finding the top-k pairs in such vast amounts of data with traditional methods is awkward. In this paper, we propose the RDD-based algorithm to perform the top-k similarity join for massive multidimensional data over a large cluster built with commodity machines using Spark. The RDD- based algorithm consists of four steps, which loads a set of multidimensional records stored in HDFS and finally output an ordered list of top-k closest pairs into HDFS. Firstly, we develop an efficient distance function based on LSH(Locality Sensitive Hashing) to improve the efficiency in pairwise similarity comparison. Secondly, to minimize the amount of data during the RDD running- time, we split conceptually all pairs of LSH signatures into partitions. Moreover, we exploit a serial computation strategy to calculate all top-k closest pairs in parallel. Finally, all the local top-k pairs sorted by their Hamming distances will contribute to the global top-k pairs. In this paper, the performance evaluation between Spark and Hadoop confirms the effectiveness and scalability of our RDD-based algorithm.

Abstract
1. Introduction
2. Background
  2.1. Locality-sensitive Hashing
  2.2. Hamming Distance
  2.3. Spark
  2.4. Resilient Distributed Datasets (RDDs)
3. Problem Definition
4. RDD-based Algorithm
  4.1. Algorithm Overview
  4.2. Computing Each point’s Signature S(p)
  4.3. Data Division
  4.4. Bucket Group
  4.5. Calculating Top-k Similar Pairs
5. Experiments
  5.1. Methodology and Cluster Setup
  5.2. Performance Evaluation
6. Related Work
7. Conclusion
References

키워드

저자정보

Dehua Chen Computer Science and Technology Academy, Donghua University, Shanghai, China
Changgan Shen Computer Science and Technology Academy, Donghua University, Shanghai, China
Jieying Feng Computer Science and Technology Academy, Donghua University, Shanghai, China
Jiajin Le Computer Science and Technology Academy, Donghua University, Shanghai, China

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

0개의 논문이 장바구니에 담겼습니다.

earticle