Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/310744
Title: | New methodologies for improving the effectiveness of text document clustering |
Researcher: | Lakshmi R |
Guide(s): | Baskar S |
Keywords: | Engineering and Technology Engineering Engineering Electrical and Electronic text document clustering |
University: | Anna University |
Completed Date: | 2019 |
Abstract: | Document categorization is a big challenge in information retrieval and text mining because the amount of document collection has been increasing, day by day. This thesis presents five proposed new methodologies for improving the effectiveness of text document clustering. Among the five methodologies, two for document representation models, one for the efficient centroid selection and two for similarity measures are proposed. Document representation plays a vital role in the processing of text document collection. The traditional document representation method, vector space model, uses a bag-of-words for representing a document vector by its term (or word) frequencies. In the traditional Term Frequency Inverse Document Frequency (TF-IDF) vector space model, the term weighting scheme measures the importance of a term in a document collection. This thesis puts forth two novel weighting schemes for vector space models to represent text documents, namely ii) Term-weighting scheme for document representation based on Term Frequency-Ranking of Term Frequency (TF-RTF) and ii).Term-weighting scheme for document representation based on Term Frequency-Ranking of Fuzzy logic with Semantic relationship of Terms (TF-RFST). The ranking of each term in a document is based on its frequency in the document. This ranking of terms provides its priority in the document and uses these priorities for document representation in TF-RTF document representation model. In the TFRFST document representation model, each term is represented based on its frequency and the frequency of semantic related terms for that term. Hence, the ranking of each term is based on the combined frequencies of the term and its semantic related terms with a specific weighting scheme. The fuzzy-C-Mean clustering algorithm has been used to find the semantic related term groups of each document in the document collection. newline |
Pagination: | xxii, 142p. |
URI: | http://hdl.handle.net/10603/310744 |
Appears in Departments: | Faculty of Information and Communication Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title.pdf | Attached File | 24.63 kB | Adobe PDF | View/Open |
02_certificates.pdf | 423.24 kB | Adobe PDF | View/Open | |
03_abstracts.pdf | 83.95 kB | Adobe PDF | View/Open | |
04_acknowledgements.pdf | 5.61 kB | Adobe PDF | View/Open | |
05_contents.pdf | 114.34 kB | Adobe PDF | View/Open | |
06_listofabbreviations.pdf | 77.76 kB | Adobe PDF | View/Open | |
07_chapter1.pdf | 243.03 kB | Adobe PDF | View/Open | |
08_chapter2.pdf | 127.32 kB | Adobe PDF | View/Open | |
09_chapter3.pdf | 746.37 kB | Adobe PDF | View/Open | |
10_chapter4.pdf | 549.28 kB | Adobe PDF | View/Open | |
11_chapter5.pdf | 1.24 MB | Adobe PDF | View/Open | |
12_chapter6.pdf | 941.37 kB | Adobe PDF | View/Open | |
13_conclusion.pdf | 46.29 kB | Adobe PDF | View/Open | |
14_references.pdf | 129.57 kB | Adobe PDF | View/Open | |
15_listofpublications.pdf | 107.59 kB | Adobe PDF | View/Open | |
80_recommendation.pdf | 156.58 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: