Web Based Content Extraction and Retrieval Using Natural Language

Sarada Devi Chillakuru

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/354103

Title:	Web Based Content Extraction and Retrieval Using Natural Language
Researcher:	Sarada Devi Chillakuru
Guide(s):	Kumanan T
Keywords:	Computer Science Computer Science Interdisciplinary Applications Engineering and Technology
University:	Meenakshi Academy of Higher Education and Research
Completed Date:	2019
Abstract:	The Internet has become a key communication and information medium for various organizations. It is a medium for accessing a great variety of information stored in different parts of the world. In recent years, the growth of the World Wide Web exceeded all expectations. Today, there are several billions of HTML documents, pictures and other multimedia files available via Internet and the number is still rising. Mining the Web data is one of the most challenging tasks for data mining and data management scholars because there are huge heterogeneous, less structured data available on the Web and can easily get overwhelmed with data. To solve the above mentioned problems in first phase presents, a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. Also HTML parser is used for cleaning and produces well-formed XTML. In this work Map reducer is used for extracting information from the web. However HTML parser has a few significant drawbacks, such as its static nature, its inability to render content in an aesthetically pleasing way, its well-known compatibility issues and its overall complexity. To overcome these issues DOM tree parser is used in second phase, DOM (Document Object Model) is the application programming interface of HTML and XML.In which the web page is parsed into a DOM tree .In this method, using two indicators to describe characteristics of web pages: text density and hyperlink density and fuzzy c means clustering used for measuring the similarity of contents. Also POS tagging is used to know the meaning of the content .Based on the tagging , here extracting the content of web page by using which map reducer.
Pagination:	xiii 127
URI:	http://hdl.handle.net/10603/354103
Appears in Departments:	Department of Engineering

Files in This Item:

File	Description	Size	Format
01_title.pdf	Attached File	409.02 kB	Adobe PDF	View/Open
02_certificate.pdf		120.23 kB	Adobe PDF	View/Open
03_declaration.pdf		120.73 kB	Adobe PDF	View/Open
04_chapter 1.pdf		799.61 kB	Adobe PDF	View/Open
05_chapter 2.pdf		475.37 kB	Adobe PDF	View/Open
06_chapter 3.pdf		1.7 MB	Adobe PDF	View/Open
07_chapter 4.pdf		1.34 MB	Adobe PDF	View/Open
08_chapter 5.pdf		353.86 kB	Adobe PDF	View/Open
09_bibilography.pdf		739.97 kB	Adobe PDF	View/Open
10_annexure.pdf		378.51 kB	Adobe PDF	View/Open
11_content.pdf		381.74 kB	Adobe PDF	View/Open
12_list of graph and table.pdf		367.39 kB	Adobe PDF	View/Open
80_recommendation.pdf		412.41 kB	Adobe PDF	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET