PDFindexer : Distributed PDF Indexing system using MapReduce

JAziz Murtazaev; Jang-Su Kihm; Sangyoon Oh

IJIBC 12-1-4

PDFindexer : Distributed PDF Indexing system using MapReduce

원문정보

JAziz Murtazaev, Jang-Su Kihm, Sangyoon Oh

국제인공지능학회(구 한국인터넷방송통신학회) International Journal of Internet, Broadcasting and Communication Vol.4 No.1 2012.02 pp.13-17

피인용수 : 0건 (자료제공 : 네이버학술정보)

초록

영어

Indexing allows converting raw document collection into easily searchable representation. Web searching by Google or Yahoo provides subsecond response time which is made possible by efficient indexing of web-pages over the entire Web. Indexing process gets challenging when the scale gets bigger. Parallel techniques, such as MapReduce framework can assist in efficient large-scale indexing process. In this paper we propose PDFindexer, system for indexing scientific papers in PDF using MapReduce programming model. Unlike Web search engines, our target domain is scientific papers, which has pre-defined structure, such as title, abstract, sections, references. Our proposed system enables parsing scientific papers in PDF recreating their structure and performing efficient distributed indexing with MapReduce framework in a cluster of nodes. We provide the overview of the system, their components and interactions among them. We discuss some issues related with the design of the system and usage of MapReduce in parsing and indexing of large document collection.

Abstract
1. Introduction
2. Background and Related Works
  A. Information Retrieval
  B. MapReduce framework
3. PDFindexer System Proposal
4. Design Details and Implementation Plan
  A. Preprocessing large-scale PDF articles with MapReduce
  B. Text-indexing of parsed articles with MapReduce
  C. Querying on resulted indices
5. Conclusion and Future Works
References

키워드

저자정보

JAziz Murtazaev Department of Computer Engineering. Ajou University, Korea
Jang-Su Kihm Department of Computer Engineering. Ajou University, Korea
Sangyoon Oh Department of Computer Engineering. Ajou University, Korea

참고문헌

자료제공 : 네이버학술정보

함께 이용한 논문

※ 원문제공기관과의 협약기간이 종료되어 열람이 제한될 수 있습니다.

0개의 논문이 장바구니에 담겼습니다.

earticle

PDFindexer : Distributed PDF Indexing system using MapReduce

원문정보

초록

목차

키워드

저자정보

참고문헌

함께 이용한 논문