earticle

논문검색

IJIBC 12-1-4

PDFindexer : Distributed PDF Indexing system using MapReduce

초록

영어

Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides subsecond response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. In this paper we propose PDFindexer, system for indexing scientific papers in PDF using MapReduce programming model. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. We provide the overview of the system, their components and interactions among them. We discuss some issues related with the design of the system and usage of MapReduce in parsing and indexing of large document collection.

목차

Abstract
 1. Introduction
 2. Background and Related Works
  A. Information Retrieval
  B. MapReduce framework
 3. PDFindexer System Proposal
 4. Design Details and Implementation Plan
  A. Preprocessing large-scale PDF articles with MapReduce
  B. Text-indexing of parsed articles with MapReduce
  C. Querying on resulted indices
 5. Conclusion and Future Works
 References

저자정보

  • JAziz Murtazaev Department of Computer Engineering. Ajou University, Korea
  • Jang-Su Kihm Department of Computer Engineering. Ajou University, Korea
  • Sangyoon Oh Department of Computer Engineering. Ajou University, Korea

참고문헌

자료제공 : 네이버학술정보

    함께 이용한 논문

      ※ 기관로그인 시 무료 이용이 가능합니다.
      ※ 학술발표대회집, 워크숍 자료집 중 4페이지 이내 논문은 '요약'만 제공되는 경우가 있으니, 구매 전에 간행물명, 페이지 수 확인 부탁 드립니다.

      • 4,000원

      0개의 논문이 장바구니에 담겼습니다.