Efficient Hybrid Distributed Document Clustering Method for Large Datasets

Judith J.E

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/250807

Title:	Efficient Hybrid Distributed Document Clustering Method for Large Datasets
Researcher:	Judith J.E
Guide(s):	Jayakumari J
Keywords:	Engineering and Technology,Computer Science,Computer Science Artificial Intelligence
University:	Noorul Islam Centre for Higher Education
Completed Date:	10/09/2015
Abstract:	ABSTRACT newlineThe growing volume of data to be analyzed enforces novel challenges to the data mining methodologies. Conventional data mining techniques such as clustering assume centralized operation on data and are computationally expensive in terms of execution time. Clustering of large datasets has received considerable attention in the past few decades in several application areas like document categorization and retrieval. newlineThis thesis deals with improving the performance of clustering technique for large high-dimensional distributed document datasets. The challenges addressed are the initial centroids problem and dimensionality problem. These challenges are addressed with an emerging Hadoop-MapReduce model for distributed storage and analysis. This methodology supports processing of large document datasets and proposes solutions for the challenges described by developing distributed clustering algorithms based on this methodology. This thesis proposes three different methods for distributed clustering namely, MapReduce KMeans (MR-KMeans) based distributed document clustering, Distributed document clustering method based on MapReduce PSO-KMeans (MR-PKMeans) and a Hybrid distributed document clustering method (MR-Hybrid). Intensive evaluations are performed resulting in optimized and semantically related document clusters with high quality and speedup. newlineIn the MapReduce K-Means (MR-KMeans) based distributed document clustering method, the algorithm is modeled with an efficient similarity measure using Hadoop framework with the main objective of improving the clustering quality and speedup of localized clustering solution. This method utilizes random initial centroids that converge the result to generate locally optimized clusters. The different stages of clustering process such as similarity calculation, assignment of document to clusters, and recalculation of new cluster centroids are all based on MapReduce methodology. Results on large document datasets show that such a framework with an efficient method of determ
Pagination:	151
URI:	http://hdl.handle.net/10603/250807
Appears in Departments:	Department of Computer Science and Engineering

Files in This Item:

File	Description	Size	Format
1 front.pdf	Attached File	183.2 kB	Adobe PDF	View/Open
3 bonafide certificate.pdf		123.26 kB	Adobe PDF	View/Open
5 acknowledgements.pdf		83.39 kB	Adobe PDF	View/Open
6 table of contents.pdf		107.39 kB	Adobe PDF	View/Open
7 list of tables.pdf		162.36 kB	Adobe PDF	View/Open
8 list of figures.pdf		101.75 kB	Adobe PDF	View/Open
9 list of abbreviations.pdf		87.67 kB	Adobe PDF	View/Open
chapter iii.pdf		521.27 kB	Adobe PDF	View/Open
chapter ii.pdf		234.33 kB	Adobe PDF	View/Open
chapter i.pdf		116.23 kB	Adobe PDF	View/Open
chapter iv.pdf		340.27 kB	Adobe PDF	View/Open
chapter vii.pdf		2.43 MB	Adobe PDF	View/Open
chapter vi.pdf		693.33 kB	Adobe PDF	View/Open
chapter v.pdf		413.65 kB	Adobe PDF	View/Open
references.pdf		106.59 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET