Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/354103
Title: Web Based Content Extraction and Retrieval Using Natural Language
Researcher: Sarada Devi Chillakuru
Guide(s): Kumanan T
Keywords: Computer Science
Computer Science Interdisciplinary Applications
Engineering and Technology
University: Meenakshi Academy of Higher Education and Research
Completed Date: 2019
Abstract: The Internet has become a key communication and information medium for various organizations. It is a medium for accessing a great variety of information stored in different parts of the world. In recent years, the growth of the World Wide Web exceeded all expectations. Today, there are several billions of HTML documents, pictures and other multimedia files available via Internet and the number is still rising. Mining the Web data is one of the most challenging tasks for data mining and data management scholars because there are huge heterogeneous, less structured data available on the Web and can easily get overwhelmed with data. To solve the above mentioned problems in first phase presents, a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. Also HTML parser is used for cleaning and produces well-formed XTML. In this work Map reducer is used for extracting information from the web. However HTML parser has a few significant drawbacks, such as its static nature, its inability to render content in an aesthetically pleasing way, its well-known compatibility issues and its overall complexity. To overcome these issues DOM tree parser is used in second phase, DOM (Document Object Model) is the application programming interface of HTML and XML.In which the web page is parsed into a DOM tree .In this method, using two indicators to describe characteristics of web pages: text density and hyperlink density and fuzzy c means clustering used for measuring the similarity of contents. Also POS tagging is used to know the meaning of the content .Based on the tagging , here extracting the content of web page by using which map reducer.
Pagination: xiii 127
URI: http://hdl.handle.net/10603/354103
Appears in Departments:Department of Engineering

Files in This Item:
File Description SizeFormat 
01_title.pdfAttached File409.02 kBAdobe PDFView/Open
02_certificate.pdf120.23 kBAdobe PDFView/Open
03_declaration.pdf120.73 kBAdobe PDFView/Open
04_chapter 1.pdf799.61 kBAdobe PDFView/Open
05_chapter 2.pdf475.37 kBAdobe PDFView/Open
06_chapter 3.pdf1.7 MBAdobe PDFView/Open
07_chapter 4.pdf1.34 MBAdobe PDFView/Open
08_chapter 5.pdf353.86 kBAdobe PDFView/Open
09_bibilography.pdf739.97 kBAdobe PDFView/Open
10_annexure.pdf378.51 kBAdobe PDFView/Open
11_content.pdf381.74 kBAdobe PDFView/Open
12_list of graph and table.pdf367.39 kBAdobe PDFView/Open
80_recommendation.pdf412.41 kBAdobe PDFView/Open
Show full item record


Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge: