earticle

논문검색

A Pipeline Model to Discover Frequent Itemsets in an Hierarchical Systems

초록

영어

Like all the other fields of data processing, the modern information systems have integrated the results of the advanced technologies of the last decades. These systems contain implicit data which it will be necessary to extract and exploit, by using data Mining techniques. Mining association rules which trends to find interesting association or correlation relationships among large amounts of data is one of these techniques. It is a two-steps process, the first step finds all frequent itemsets and the second step constructs association rules from these frequent sets. The overall performance of mining association rules is determined by the first step which becomes the focus problem. This step is expensive with high demands for computation and data access. Parallel computing seems to have a natural role to play since parallel computers provides scalability. In this paper, we examine the issue of mining association rules among items in large databases transactions using the algorithm Apriori proposed by Agrawal. In this context, we propose a new parallel version of the Apriori algorithm of Agrawal, that is the main algorithm of each data mining technique. Parallel computing seems to have a natural role to play since parallel computers provides scalability. In fact, our objective of our work is to have an efficient parallel execution time that requires a delicate balance between program granularity and communication latency (synchronization overhead) between the different granules. Unlike previous work on parallelization of specific data mining algorithms, our approaches consist to discover the different granularity levels of parallelism and their impact on the performance. In this paper we focus on task and data parallelism (hybrid approach) under distributed memory. In particular, if communication latency is minimal then fine grain partitioning will yield the best performance. This is the case when data parallelism is used. If communication latency is large (as in a loosely coupled system), then coarse grain partitioning is more appropriate. For the target architecture used in this work (distributed-shared memory).), the problem of load balancing among the nodes becomes a more critical issue in attempts to yield high performance. We have carried out a detailed evaluation of the parallelization techniques and the impact of combining different types of parallelism (task, data and pipeline) on the effectiveness of the system.

목차

Abstract
 1. Introduction
 2. Parallel and Distributed Association Rule Mining Algorithms
 3. Parallel Approach to Discover Frequent Itemsets
  3.1. Apriori algorithm
  3.2. First approach: Task parallelism
  3.3. Dependency graph
 3. Data Parallelism
  4.1. Parallelization of the task T1
  4.2. Parallelization of the task T2-1
  4.3. Parallelization of task T2-2
  4.4. Parallelization of task T4
  4.5. Parallelization of task T5
  4.6. Parallelization of task T6-1 and T6-2
 5. Hybrid Approach
  5.1. Hybrid Approach using pipeline model
  5.2. Hybrid Approach without using pipeline model
 6. Experimental Evaluation
  6.1. Experimental Platform
  6.2. Comparative study between the different proposed approaches
 7. Conclusion and Future Works
 References

저자정보

  • Khedija Arour Department of Computer Science and Mathematics, National Institute of Applied Science and Technology of Tunisia 1080 Tunis, Tunisia

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

      0개의 논문이 장바구니에 담겼습니다.