Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/343259
Title: A novel approach for duplicate elimination and effective topic modeling for document clustering
Researcher: Uma R
Guide(s): Latha B
Keywords: Engineering and Technology
Computer Science
Computer Science Information Systems
Document Clustering
Duplicate Elimination
Information Retrieval
Sub Topic Model
University: Anna University
Completed Date: 2020
Abstract: Information available on the web increases at a fast pace, within the last two years, there has been an explosive growth of internet information. A great amount of information available in the web is textual information. Textual information plays a vital part in IR and it is probably the most useful information. Searching the Web is becoming dominant due to the fact of richness in information available and convenience in getting the information required .Web search is rooted towards Information Retrieval (IR) which is a study that assists users in finding the required information from a large corpus of documents. The documents in the web are called WebPages. Relevancy and efficiency are the ultimate issues in web search. WebPages are semi-structured in nature. The content in a page is organized and presented in multiple structured blocks. Some blocks contain vital information and others are not. Detecting the main content blocks actively from a webpage is useful in searching the web because terms that are found in those blocks are more important. Users face a great difficulty in identifying the relevant information. The existing approaches need to improve the accuracy in terms of relevancy. Information retrieval is a way to separate relevant data from the irrelevant. Documents on the web are available in different formats. Conventional information retrieval methods operate on clean text, if there is noise in the data it has to be cleaned for efficient retrieval. This research work takes an initiate to increase the retrieval accuracy, relevancy and increase the performance of retrieval for text documents. To attain these goals the search space has to be reduced and the underlying semantics need to be identified. newline
Pagination: xviii, 151p.
URI: http://hdl.handle.net/10603/343259
Appears in Departments:Faculty of Information and Communication Engineering

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File24.12 kBAdobe PDFView/Open
02_certificates.pdf563.93 kBAdobe PDFView/Open
03_abstracts.pdf14.14 kBAdobe PDFView/Open
04_acknowledgements.pdf456.88 kBAdobe PDFView/Open
05_contents.pdf15.32 kBAdobe PDFView/Open
06_listoftables.pdf10.04 kBAdobe PDFView/Open
07_listoffigures.pdf16.59 kBAdobe PDFView/Open
08_listofabbreviations.pdf12.33 kBAdobe PDFView/Open
09_chapter1.pdf434.78 kBAdobe PDFView/Open
10_chapter2.pdf466.36 kBAdobe PDFView/Open
11_chapter3.pdf511.47 kBAdobe PDFView/Open
12_chapter4.pdf1.73 MBAdobe PDFView/Open
13_chapter5.pdf1.13 MBAdobe PDFView/Open
14_conclusion.pdf22.45 kBAdobe PDFView/Open
15_references.pdf557.88 kBAdobe PDFView/Open
16_listofpublications.pdf326.01 kBAdobe PDFView/Open
80_recommendation.pdf82.65 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: