Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/69683
Title: Document clustering using a fuzzy representation of clusters
Researcher: Thaoroijam, Kabita
Guide(s): Mahanta, Anjana Kakoti
Keywords: Algorithm
Clustering
Complexity
Document
Fuzzy
Pearson
Similarity
University: Gauhati University
Completed Date: 31/12/2009
Abstract: In the last decades, the volume of text databases has rapidly grown due to the increasing amount of information available in electronic forms, such as WWW, emails, newsgroup messages, Internet news feeds, digital libraries, etc. Clustering can help in organizing the text collection for efficient browsing and searching. This has been a driving force for making clustering is a highly active research area. Document clustering is a subset of the larger field of data clustering, which borrows concepts from the fields of Information Retrieval (IR), Natural Language Processing (NLP), and Machine Learning (ML), among others. In this thesis, we propose a new document clustering algorithm where the concepts of fuzzy sets have been used. The proposed algorithm is agglomerative and at any given stage of the algorithm there are small clusters and the decision at the current stage is to merge the incoming document with the cluster that statisfies a user specified threshold. The clusters obtained are represented as fuzzy sets over a finite universal set which provides a compact representation of clusters. A similarity measure based on the fuzzy representation of the clusters is defined. The algorithm requires just one pass through the dataset and only the compact representations of the clusters are kept in the memory at any given time. Our algorithm is incremental and can deal with the dynamic nature of real world data. With arbitrarily large datasets, the datasets cannot fit in memory. Several clustering algorithms are proposed for large datasets which follow a two-phase approach. We propose a two-phase approach to the clustering problem of large dataset. In the first phase, a single pass over the database is used to produce an in-memory summary of the data set. In the second phase, the in-memory summary of the data set obtained in the previous phase is merged based on the concepts of neighbors and links.
Pagination: 
URI: http://hdl.handle.net/10603/69683
Appears in Departments:Department of Computer Science and Application

Files in This Item:
File Description SizeFormat 
01_title page.pdfAttached File30.4 kBAdobe PDFView/Open
02_certificate.pdf23.23 kBAdobe PDFView/Open
03_declaration.pdf14.83 kBAdobe PDFView/Open
04_content.pdf66.13 kBAdobe PDFView/Open
05_acknowledgement.pdf28.32 kBAdobe PDFView/Open
06_abstract.pdf46.82 kBAdobe PDFView/Open
07_list of tables.pdf11.84 kBAdobe PDFView/Open
08_list of figures.pdf9.99 kBAdobe PDFView/Open
09_list of abbreviation.pdf18.54 kBAdobe PDFView/Open
10_chapter 1.pdf364.43 kBAdobe PDFView/Open
11_chapter 2.pdf933.76 kBAdobe PDFView/Open
12_chapter 3.pdf252.2 kBAdobe PDFView/Open
13_chapter 4.pdf363.42 kBAdobe PDFView/Open
14_chapter 5.pdf467.56 kBAdobe PDFView/Open
15_conclusions and further works.pdf114.1 kBAdobe PDFView/Open
16_appendix a.pdf99.35 kBAdobe PDFView/Open
17_bibliography.pdf325.36 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: