Document clustering using a fuzzy representation of clusters

Thaoroijam, Kabita

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/69683

Title:	Document clustering using a fuzzy representation of clusters
Researcher:	Thaoroijam, Kabita
Guide(s):	Mahanta, Anjana Kakoti
Keywords:	Algorithm Clustering Complexity Document Fuzzy Pearson Similarity
University:	Gauhati University
Completed Date:	31/12/2009
Abstract:	In the last decades, the volume of text databases has rapidly grown due to the increasing amount of information available in electronic forms, such as WWW, emails, newsgroup messages, Internet news feeds, digital libraries, etc. Clustering can help in organizing the text collection for efficient browsing and searching. This has been a driving force for making clustering is a highly active research area. Document clustering is a subset of the larger field of data clustering, which borrows concepts from the fields of Information Retrieval (IR), Natural Language Processing (NLP), and Machine Learning (ML), among others. In this thesis, we propose a new document clustering algorithm where the concepts of fuzzy sets have been used. The proposed algorithm is agglomerative and at any given stage of the algorithm there are small clusters and the decision at the current stage is to merge the incoming document with the cluster that statisfies a user specified threshold. The clusters obtained are represented as fuzzy sets over a finite universal set which provides a compact representation of clusters. A similarity measure based on the fuzzy representation of the clusters is defined. The algorithm requires just one pass through the dataset and only the compact representations of the clusters are kept in the memory at any given time. Our algorithm is incremental and can deal with the dynamic nature of real world data. With arbitrarily large datasets, the datasets cannot fit in memory. Several clustering algorithms are proposed for large datasets which follow a two-phase approach. We propose a two-phase approach to the clustering problem of large dataset. In the first phase, a single pass over the database is used to produce an in-memory summary of the data set. In the second phase, the in-memory summary of the data set obtained in the previous phase is merged based on the concepts of neighbors and links.
Pagination:
URI:	http://hdl.handle.net/10603/69683
Appears in Departments:	Department of Computer Science and Application

Files in This Item:

File	Description	Size	Format
01_title page.pdf	Attached File	30.4 kB	Adobe PDF	View/Open
02_certificate.pdf		23.23 kB	Adobe PDF	View/Open
03_declaration.pdf		14.83 kB	Adobe PDF	View/Open
04_content.pdf		66.13 kB	Adobe PDF	View/Open
05_acknowledgement.pdf		28.32 kB	Adobe PDF	View/Open
06_abstract.pdf		46.82 kB	Adobe PDF	View/Open
07_list of tables.pdf		11.84 kB	Adobe PDF	View/Open
08_list of figures.pdf		9.99 kB	Adobe PDF	View/Open
09_list of abbreviation.pdf		18.54 kB	Adobe PDF	View/Open
10_chapter 1.pdf		364.43 kB	Adobe PDF	View/Open
11_chapter 2.pdf		933.76 kB	Adobe PDF	View/Open
12_chapter 3.pdf		252.2 kB	Adobe PDF	View/Open
13_chapter 4.pdf		363.42 kB	Adobe PDF	View/Open
14_chapter 5.pdf		467.56 kB	Adobe PDF	View/Open
15_conclusions and further works.pdf		114.1 kB	Adobe PDF	View/Open
16_appendix a.pdf		99.35 kB	Adobe PDF	View/Open
17_bibliography.pdf		325.36 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET