Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/425060
Title: Efficient Text Clustering Techniques for Big Datasets
Researcher: Mehta, Vivek
Guide(s): Bawa, Seema and Singh, Jasmeet
Keywords: Big Datasets
Computer Science
Computer Science Interdisciplinary Applications
Document clustering
Engineering and Technology
High Dimensionality
Unsupervised Learning
Word Embeddings
University: Thapar Institute of Engineering and Technology
Completed Date: 2021
Abstract: Clustering is regarded as one of the most important tools for data analysis, especially when label information is not available. Basically, it segregates a collection of data points into such groups that each group contains as similar data points as possible. A Big dataset in general, is characterized by several complexities including high dimensionality. Specifically, in the case of textual datasets, high dimensionality poses a great challenge for clustering as well as other text mining tasks. In a textual dataset, the number of unique words across the whole corpus (set of documents) becomes the dimensionality of the dataset. Hence, the number of dimensions can reach anywhere from tens of thousands to a few millions, for a dataset containing some thousands of documents. In addition, the matrix representation of such datasets become very sparse (containing a large number of zeros). These major challenges make traditional clustering techniques such as partitioning-based, hierarchical, and density-based unsuitable for clustering on such high-dimensional and sparse data. In some cases, they even fail to perform clustering. Another important challenge in the case of textual datasets is to include the semantics (meaning) of text while forming clusters. In the literature, several semantic-based text clustering techniques are also defined which consider the semantics and to some extent attempts to reduce the high dimensionality problem. Still, there is a crucial requirement of text clustering techniques that can scale to the high dimensionality of large textual datasets. In this thesis, such text clustering techniques have been proposed that attempt to simultaneously solve the aforementioned challenges. The first proposed technique is named Stamantic Clustering which is based on lexical chains (groups of semantically related words) and WordNet (a lexical database for English).
Pagination: xiv, 115p.
URI: http://hdl.handle.net/10603/425060
Appears in Departments:Department of Computer Science and Engineering

Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: