Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/250807
Title: Efficient Hybrid Distributed Document Clustering Method for Large Datasets
Researcher: Judith J.E
Guide(s): Jayakumari J
Keywords: Engineering and Technology,Computer Science,Computer Science Artificial Intelligence
University: Noorul Islam Centre for Higher Education
Completed Date: 10/09/2015
Abstract: ABSTRACT newlineThe growing volume of data to be analyzed enforces novel challenges to the data mining methodologies. Conventional data mining techniques such as clustering assume centralized operation on data and are computationally expensive in terms of execution time. Clustering of large datasets has received considerable attention in the past few decades in several application areas like document categorization and retrieval. newlineThis thesis deals with improving the performance of clustering technique for large high-dimensional distributed document datasets. The challenges addressed are the initial centroids problem and dimensionality problem. These challenges are addressed with an emerging Hadoop-MapReduce model for distributed storage and analysis. This methodology supports processing of large document datasets and proposes solutions for the challenges described by developing distributed clustering algorithms based on this methodology. This thesis proposes three different methods for distributed clustering namely, MapReduce KMeans (MR-KMeans) based distributed document clustering, Distributed document clustering method based on MapReduce PSO-KMeans (MR-PKMeans) and a Hybrid distributed document clustering method (MR-Hybrid). Intensive evaluations are performed resulting in optimized and semantically related document clusters with high quality and speedup. newlineIn the MapReduce K-Means (MR-KMeans) based distributed document clustering method, the algorithm is modeled with an efficient similarity measure using Hadoop framework with the main objective of improving the clustering quality and speedup of localized clustering solution. This method utilizes random initial centroids that converge the result to generate locally optimized clusters. The different stages of clustering process such as similarity calculation, assignment of document to clusters, and recalculation of new cluster centroids are all based on MapReduce methodology. Results on large document datasets show that such a framework with an efficient method of determ
Pagination: 151
URI: http://hdl.handle.net/10603/250807
Appears in Departments:Department of Computer Science and Engineering

Files in This Item:
File Description SizeFormat 
1 front.pdfAttached File183.2 kBAdobe PDFView/Open
3 bonafide certificate.pdf123.26 kBAdobe PDFView/Open
5 acknowledgements.pdf83.39 kBAdobe PDFView/Open
6 table of contents.pdf107.39 kBAdobe PDFView/Open
7 list of tables.pdf162.36 kBAdobe PDFView/Open
8 list of figures.pdf101.75 kBAdobe PDFView/Open
9 list of abbreviations.pdf87.67 kBAdobe PDFView/Open
chapter iii.pdf521.27 kBAdobe PDFView/Open
chapter ii.pdf234.33 kBAdobe PDFView/Open
chapter i.pdf116.23 kBAdobe PDFView/Open
chapter iv.pdf340.27 kBAdobe PDFView/Open
chapter vii.pdf2.43 MBAdobe PDFView/Open
chapter vi.pdf693.33 kBAdobe PDFView/Open
chapter v.pdf413.65 kBAdobe PDFView/Open
references.pdf106.59 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: