Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/69683
Title: | Document clustering using a fuzzy representation of clusters |
Researcher: | Thaoroijam, Kabita |
Guide(s): | Mahanta, Anjana Kakoti |
Keywords: | Algorithm Clustering Complexity Document Fuzzy Pearson Similarity |
University: | Gauhati University |
Completed Date: | 31/12/2009 |
Abstract: | In the last decades, the volume of text databases has rapidly grown due to the increasing amount of information available in electronic forms, such as WWW, emails, newsgroup messages, Internet news feeds, digital libraries, etc. Clustering can help in organizing the text collection for efficient browsing and searching. This has been a driving force for making clustering is a highly active research area. Document clustering is a subset of the larger field of data clustering, which borrows concepts from the fields of Information Retrieval (IR), Natural Language Processing (NLP), and Machine Learning (ML), among others. In this thesis, we propose a new document clustering algorithm where the concepts of fuzzy sets have been used. The proposed algorithm is agglomerative and at any given stage of the algorithm there are small clusters and the decision at the current stage is to merge the incoming document with the cluster that statisfies a user specified threshold. The clusters obtained are represented as fuzzy sets over a finite universal set which provides a compact representation of clusters. A similarity measure based on the fuzzy representation of the clusters is defined. The algorithm requires just one pass through the dataset and only the compact representations of the clusters are kept in the memory at any given time. Our algorithm is incremental and can deal with the dynamic nature of real world data. With arbitrarily large datasets, the datasets cannot fit in memory. Several clustering algorithms are proposed for large datasets which follow a two-phase approach. We propose a two-phase approach to the clustering problem of large dataset. In the first phase, a single pass over the database is used to produce an in-memory summary of the data set. In the second phase, the in-memory summary of the data set obtained in the previous phase is merged based on the concepts of neighbors and links. |
Pagination: | |
URI: | http://hdl.handle.net/10603/69683 |
Appears in Departments: | Department of Computer Science and Application |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title page.pdf | Attached File | 30.4 kB | Adobe PDF | View/Open |
02_certificate.pdf | 23.23 kB | Adobe PDF | View/Open | |
03_declaration.pdf | 14.83 kB | Adobe PDF | View/Open | |
04_content.pdf | 66.13 kB | Adobe PDF | View/Open | |
05_acknowledgement.pdf | 28.32 kB | Adobe PDF | View/Open | |
06_abstract.pdf | 46.82 kB | Adobe PDF | View/Open | |
07_list of tables.pdf | 11.84 kB | Adobe PDF | View/Open | |
08_list of figures.pdf | 9.99 kB | Adobe PDF | View/Open | |
09_list of abbreviation.pdf | 18.54 kB | Adobe PDF | View/Open | |
10_chapter 1.pdf | 364.43 kB | Adobe PDF | View/Open | |
11_chapter 2.pdf | 933.76 kB | Adobe PDF | View/Open | |
12_chapter 3.pdf | 252.2 kB | Adobe PDF | View/Open | |
13_chapter 4.pdf | 363.42 kB | Adobe PDF | View/Open | |
14_chapter 5.pdf | 467.56 kB | Adobe PDF | View/Open | |
15_conclusions and further works.pdf | 114.1 kB | Adobe PDF | View/Open | |
16_appendix a.pdf | 99.35 kB | Adobe PDF | View/Open | |
17_bibliography.pdf | 325.36 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: