Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/425060
Full metadata record
DC FieldValueLanguage
dc.date.accessioned2022-12-13T06:28:19Z-
dc.date.available2022-12-13T06:28:19Z-
dc.identifier.urihttp://hdl.handle.net/10603/425060-
dc.description.abstractClustering is regarded as one of the most important tools for data analysis, especially when label information is not available. Basically, it segregates a collection of data points into such groups that each group contains as similar data points as possible. A Big dataset in general, is characterized by several complexities including high dimensionality. Specifically, in the case of textual datasets, high dimensionality poses a great challenge for clustering as well as other text mining tasks. In a textual dataset, the number of unique words across the whole corpus (set of documents) becomes the dimensionality of the dataset. Hence, the number of dimensions can reach anywhere from tens of thousands to a few millions, for a dataset containing some thousands of documents. In addition, the matrix representation of such datasets become very sparse (containing a large number of zeros). These major challenges make traditional clustering techniques such as partitioning-based, hierarchical, and density-based unsuitable for clustering on such high-dimensional and sparse data. In some cases, they even fail to perform clustering. Another important challenge in the case of textual datasets is to include the semantics (meaning) of text while forming clusters. In the literature, several semantic-based text clustering techniques are also defined which consider the semantics and to some extent attempts to reduce the high dimensionality problem. Still, there is a crucial requirement of text clustering techniques that can scale to the high dimensionality of large textual datasets. In this thesis, such text clustering techniques have been proposed that attempt to simultaneously solve the aforementioned challenges. The first proposed technique is named Stamantic Clustering which is based on lexical chains (groups of semantically related words) and WordNet (a lexical database for English).-
dc.format.extentxiv, 115p.-
dc.languageEnglish-
dc.rightsuniversity-
dc.titleEfficient Text Clustering Techniques for Big Datasets-
dc.creator.researcherMehta, Vivek-
dc.subject.keywordBig Datasets-
dc.subject.keywordComputer Science-
dc.subject.keywordComputer Science Interdisciplinary Applications-
dc.subject.keywordDocument clustering-
dc.subject.keywordEngineering and Technology-
dc.subject.keywordHigh Dimensionality-
dc.subject.keywordUnsupervised Learning-
dc.subject.keywordWord Embeddings-
dc.contributor.guideBawa, Seema and Singh, Jasmeet-
dc.publisher.placePatiala-
dc.publisher.universityThapar Institute of Engineering and Technology-
dc.publisher.institutionDepartment of Computer Science and Engineering-
dc.date.completed2021-
dc.date.awarded2021-
dc.format.accompanyingmaterialNone-
dc.source.universityUniversity-
dc.type.degreePh.D.-
Appears in Departments:Department of Computer Science and Engineering



Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: