Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/250807
Title: | Efficient Hybrid Distributed Document Clustering Method for Large Datasets |
Researcher: | Judith J.E |
Guide(s): | Jayakumari J |
Keywords: | Engineering and Technology,Computer Science,Computer Science Artificial Intelligence |
University: | Noorul Islam Centre for Higher Education |
Completed Date: | 10/09/2015 |
Abstract: | ABSTRACT newlineThe growing volume of data to be analyzed enforces novel challenges to the data mining methodologies. Conventional data mining techniques such as clustering assume centralized operation on data and are computationally expensive in terms of execution time. Clustering of large datasets has received considerable attention in the past few decades in several application areas like document categorization and retrieval. newlineThis thesis deals with improving the performance of clustering technique for large high-dimensional distributed document datasets. The challenges addressed are the initial centroids problem and dimensionality problem. These challenges are addressed with an emerging Hadoop-MapReduce model for distributed storage and analysis. This methodology supports processing of large document datasets and proposes solutions for the challenges described by developing distributed clustering algorithms based on this methodology. This thesis proposes three different methods for distributed clustering namely, MapReduce KMeans (MR-KMeans) based distributed document clustering, Distributed document clustering method based on MapReduce PSO-KMeans (MR-PKMeans) and a Hybrid distributed document clustering method (MR-Hybrid). Intensive evaluations are performed resulting in optimized and semantically related document clusters with high quality and speedup. newlineIn the MapReduce K-Means (MR-KMeans) based distributed document clustering method, the algorithm is modeled with an efficient similarity measure using Hadoop framework with the main objective of improving the clustering quality and speedup of localized clustering solution. This method utilizes random initial centroids that converge the result to generate locally optimized clusters. The different stages of clustering process such as similarity calculation, assignment of document to clusters, and recalculation of new cluster centroids are all based on MapReduce methodology. Results on large document datasets show that such a framework with an efficient method of determ |
Pagination: | 151 |
URI: | http://hdl.handle.net/10603/250807 |
Appears in Departments: | Department of Computer Science and Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
1 front.pdf | Attached File | 183.2 kB | Adobe PDF | View/Open |
3 bonafide certificate.pdf | 123.26 kB | Adobe PDF | View/Open | |
5 acknowledgements.pdf | 83.39 kB | Adobe PDF | View/Open | |
6 table of contents.pdf | 107.39 kB | Adobe PDF | View/Open | |
7 list of tables.pdf | 162.36 kB | Adobe PDF | View/Open | |
8 list of figures.pdf | 101.75 kB | Adobe PDF | View/Open | |
9 list of abbreviations.pdf | 87.67 kB | Adobe PDF | View/Open | |
chapter iii.pdf | 521.27 kB | Adobe PDF | View/Open | |
chapter ii.pdf | 234.33 kB | Adobe PDF | View/Open | |
chapter i.pdf | 116.23 kB | Adobe PDF | View/Open | |
chapter iv.pdf | 340.27 kB | Adobe PDF | View/Open | |
chapter vii.pdf | 2.43 MB | Adobe PDF | View/Open | |
chapter vi.pdf | 693.33 kB | Adobe PDF | View/Open | |
chapter v.pdf | 413.65 kB | Adobe PDF | View/Open | |
references.pdf | 106.59 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: