earticle

논문검색

An Improved K-means Algorithm based on Mapreduce and Grid

초록

영어

The traditional K-means clustering algorithm is difficult to initialize the number of clusters K, and the initial cluster centers are selected randomly, this makes the clustering results very unstable. Meanwhile, algorithms are susceptible to noise points. To solve the problems, the traditional K-means algorithm is improved. The improved method is divided into the same grid in space, according to the size of the data point property value and assigns it to the corresponding grid. And count the number of data points in each grid. Selecting M(M>K) grids, comprising the maximum number of data points, and calculate the central point. These M central points as input data, and then to determine the k value based on the clustering results. In the M points, find K points farthest from each other and those K center points as the initial cluster center of K-means clustering algorithm. At the same time, the maximum value in M must be included in K. If the number of data in the grid less than the threshold, then these points will be considered as noise points and be removed. In order to make the improved algorithm can adapt to handle large data. We will parallel the improved k-mean algorithm and combined with the MapReduce framework. Theoretical analysis and experimental results show that the improved algorithm compared to the traditional K-means clustering algorithm has high quality results, less iteration and has good stability. Parallelized algorithm has a very high efficiency in data processing, and has good scalability and speedup.

목차

Abstract
 1. Introduction
 2. Relevant Methods
  2.1 K-means Clustering Algorithm
  2.2 K-means Algorithm Advantages and Disadvantages:
  2.3 Improved method of K-means algorithm
  2.4 Improved k-means Algorithm Parallelization
 3. Experimental Analysis
 4. Conclusion
 Acknowledgements
 References

저자정보

  • Li Ma Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing 210044, School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing 210044, Key Laboratory of Meteorological Disaster of Ministry of Education Nanjing University of Information Science & Technology, Nanjing 210044
  • Lei Gu Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing 210044, School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing 210044
  • Bo Li Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing 210044 , CMA Research Centre for Strategic Development, Beijing 100081
  • Yue Ma School of Mathematics and Statistics, Nanjing University of Information Science & Technology, Nanjing 210044
  • Jin Wang Jiangsu Engineering Center of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing 210044, School of Computer & Software, Nanjing University of Information Science & Technology, Nanjing 210044, Key Laboratory of Meteorological Disaster of Ministry of Education Nanjing University of Information Science & Technology, Nanjing 210044

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.