New methodologies for improving the effectiveness of text document clustering

Lakshmi R

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/310744

Title:	New methodologies for improving the effectiveness of text document clustering
Researcher:	Lakshmi R
Guide(s):	Baskar S
Keywords:	Engineering and Technology Engineering Engineering Electrical and Electronic text document clustering
University:	Anna University
Completed Date:	2019
Abstract:	Document categorization is a big challenge in information retrieval and text mining because the amount of document collection has been increasing, day by day. This thesis presents five proposed new methodologies for improving the effectiveness of text document clustering. Among the five methodologies, two for document representation models, one for the efficient centroid selection and two for similarity measures are proposed. Document representation plays a vital role in the processing of text document collection. The traditional document representation method, vector space model, uses a bag-of-words for representing a document vector by its term (or word) frequencies. In the traditional Term Frequency Inverse Document Frequency (TF-IDF) vector space model, the term weighting scheme measures the importance of a term in a document collection. This thesis puts forth two novel weighting schemes for vector space models to represent text documents, namely ii) Term-weighting scheme for document representation based on Term Frequency-Ranking of Term Frequency (TF-RTF) and ii).Term-weighting scheme for document representation based on Term Frequency-Ranking of Fuzzy logic with Semantic relationship of Terms (TF-RFST). The ranking of each term in a document is based on its frequency in the document. This ranking of terms provides its priority in the document and uses these priorities for document representation in TF-RTF document representation model. In the TFRFST document representation model, each term is represented based on its frequency and the frequency of semantic related terms for that term. Hence, the ranking of each term is based on the combined frequencies of the term and its semantic related terms with a specific weighting scheme. The fuzzy-C-Mean clustering algorithm has been used to find the semantic related term groups of each document in the document collection. newline
Pagination:	xxii, 142p.
URI:	http://hdl.handle.net/10603/310744
Appears in Departments:	Faculty of Information and Communication Engineering

Files in This Item:

File	Description	Size	Format
01_title.pdf	Attached File	24.63 kB	Adobe PDF	View/Open
02_certificates.pdf		423.24 kB	Adobe PDF	View/Open
03_abstracts.pdf		83.95 kB	Adobe PDF	View/Open
04_acknowledgements.pdf		5.61 kB	Adobe PDF	View/Open
05_contents.pdf		114.34 kB	Adobe PDF	View/Open
06_listofabbreviations.pdf		77.76 kB	Adobe PDF	View/Open
07_chapter1.pdf		243.03 kB	Adobe PDF	View/Open
08_chapter2.pdf		127.32 kB	Adobe PDF	View/Open
09_chapter3.pdf		746.37 kB	Adobe PDF	View/Open
10_chapter4.pdf		549.28 kB	Adobe PDF	View/Open
11_chapter5.pdf		1.24 MB	Adobe PDF	View/Open
12_chapter6.pdf		941.37 kB	Adobe PDF	View/Open
13_conclusion.pdf		46.29 kB	Adobe PDF	View/Open
14_references.pdf		129.57 kB	Adobe PDF	View/Open
15_listofpublications.pdf		107.59 kB	Adobe PDF	View/Open
80_recommendation.pdf		156.58 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET