Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/310744
Title: New methodologies for improving the effectiveness of text document clustering
Researcher: Lakshmi R
Guide(s): Baskar S
Keywords: Engineering and Technology
Engineering
Engineering Electrical and Electronic
text document
clustering
University: Anna University
Completed Date: 2019
Abstract: Document categorization is a big challenge in information retrieval and text mining because the amount of document collection has been increasing, day by day. This thesis presents five proposed new methodologies for improving the effectiveness of text document clustering. Among the five methodologies, two for document representation models, one for the efficient centroid selection and two for similarity measures are proposed. Document representation plays a vital role in the processing of text document collection. The traditional document representation method, vector space model, uses a bag-of-words for representing a document vector by its term (or word) frequencies. In the traditional Term Frequency Inverse Document Frequency (TF-IDF) vector space model, the term weighting scheme measures the importance of a term in a document collection. This thesis puts forth two novel weighting schemes for vector space models to represent text documents, namely ii) Term-weighting scheme for document representation based on Term Frequency-Ranking of Term Frequency (TF-RTF) and ii).Term-weighting scheme for document representation based on Term Frequency-Ranking of Fuzzy logic with Semantic relationship of Terms (TF-RFST). The ranking of each term in a document is based on its frequency in the document. This ranking of terms provides its priority in the document and uses these priorities for document representation in TF-RTF document representation model. In the TFRFST document representation model, each term is represented based on its frequency and the frequency of semantic related terms for that term. Hence, the ranking of each term is based on the combined frequencies of the term and its semantic related terms with a specific weighting scheme. The fuzzy-C-Mean clustering algorithm has been used to find the semantic related term groups of each document in the document collection. newline
Pagination: xxii, 142p.
URI: http://hdl.handle.net/10603/310744
Appears in Departments:Faculty of Information and Communication Engineering

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File24.63 kBAdobe PDFView/Open
02_certificates.pdf423.24 kBAdobe PDFView/Open
03_abstracts.pdf83.95 kBAdobe PDFView/Open
04_acknowledgements.pdf5.61 kBAdobe PDFView/Open
05_contents.pdf114.34 kBAdobe PDFView/Open
06_listofabbreviations.pdf77.76 kBAdobe PDFView/Open
07_chapter1.pdf243.03 kBAdobe PDFView/Open
08_chapter2.pdf127.32 kBAdobe PDFView/Open
09_chapter3.pdf746.37 kBAdobe PDFView/Open
10_chapter4.pdf549.28 kBAdobe PDFView/Open
11_chapter5.pdf1.24 MBAdobe PDFView/Open
12_chapter6.pdf941.37 kBAdobe PDFView/Open
13_conclusion.pdf46.29 kBAdobe PDFView/Open
14_references.pdf129.57 kBAdobe PDFView/Open
15_listofpublications.pdf107.59 kBAdobe PDFView/Open
80_recommendation.pdf156.58 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: