원문정보
초록
영어
Top-k similarity join has been used in a wide range of applications that require calculating the most top-k similar pairs of data records in a given database. However, the time performance will be a challenging problem, as an increasing trend of applications that need to process massive data. Obviously, finding the top-k pairs in such vast amounts of data with traditional methods is awkward. In this paper, we propose the RDD-based algorithm to perform the top-k similarity join for massive multidimensional data over a large cluster built with commodity machines using Spark. The RDD- based algorithm consists of four steps, which loads a set of multidimensional records stored in HDFS and finally output an ordered list of top-k closest pairs into HDFS. Firstly, we develop an efficient distance function based on LSH(Locality Sensitive Hashing) to improve the efficiency in pairwise similarity comparison. Secondly, to minimize the amount of data during the RDD running- time, we split conceptually all pairs of LSH signatures into partitions. Moreover, we exploit a serial computation strategy to calculate all top-k closest pairs in parallel. Finally, all the local top-k pairs sorted by their Hamming distances will contribute to the global top-k pairs. In this paper, the performance evaluation between Spark and Hadoop confirms the effectiveness and scalability of our RDD-based algorithm.
목차
1. Introduction
2. Background
2.1. Locality-sensitive Hashing
2.2. Hamming Distance
2.3. Spark
2.4. Resilient Distributed Datasets (RDDs)
3. Problem Definition
4. RDD-based Algorithm
4.1. Algorithm Overview
4.2. Computing Each point’s Signature S(p)
4.3. Data Division
4.4. Bucket Group
4.5. Calculating Top-k Similar Pairs
5. Experiments
5.1. Methodology and Cluster Setup
5.2. Performance Evaluation
6. Related Work
7. Conclusion
References
