Please use this identifier to cite or link to this item:
http://hdl.handle.net/10603/354103
Title: | Web Based Content Extraction and Retrieval Using Natural Language |
Researcher: | Sarada Devi Chillakuru |
Guide(s): | Kumanan T |
Keywords: | Computer Science Computer Science Interdisciplinary Applications Engineering and Technology |
University: | Meenakshi Academy of Higher Education and Research |
Completed Date: | 2019 |
Abstract: | The Internet has become a key communication and information medium for various organizations. It is a medium for accessing a great variety of information stored in different parts of the world. In recent years, the growth of the World Wide Web exceeded all expectations. Today, there are several billions of HTML documents, pictures and other multimedia files available via Internet and the number is still rising. Mining the Web data is one of the most challenging tasks for data mining and data management scholars because there are huge heterogeneous, less structured data available on the Web and can easily get overwhelmed with data. To solve the above mentioned problems in first phase presents, a discrete framework for running NLP tasks in a parallel fashion and crawling web documents. NLP is one of the significant features which can be utilized for text explanation and first feature extraction from request area with high computational supplies; therefore, these responsibilities can have advantage over similar architectures. Also HTML parser is used for cleaning and produces well-formed XTML. In this work Map reducer is used for extracting information from the web. However HTML parser has a few significant drawbacks, such as its static nature, its inability to render content in an aesthetically pleasing way, its well-known compatibility issues and its overall complexity. To overcome these issues DOM tree parser is used in second phase, DOM (Document Object Model) is the application programming interface of HTML and XML.In which the web page is parsed into a DOM tree .In this method, using two indicators to describe characteristics of web pages: text density and hyperlink density and fuzzy c means clustering used for measuring the similarity of contents. Also POS tagging is used to know the meaning of the content .Based on the tagging , here extracting the content of web page by using which map reducer. |
Pagination: | xiii 127 |
URI: | http://hdl.handle.net/10603/354103 |
Appears in Departments: | Department of Engineering |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
01_title.pdf | Attached File | 409.02 kB | Adobe PDF | View/Open |
02_certificate.pdf | 120.23 kB | Adobe PDF | View/Open | |
03_declaration.pdf | 120.73 kB | Adobe PDF | View/Open | |
04_chapter 1.pdf | 799.61 kB | Adobe PDF | View/Open | |
05_chapter 2.pdf | 475.37 kB | Adobe PDF | View/Open | |
06_chapter 3.pdf | 1.7 MB | Adobe PDF | View/Open | |
07_chapter 4.pdf | 1.34 MB | Adobe PDF | View/Open | |
08_chapter 5.pdf | 353.86 kB | Adobe PDF | View/Open | |
09_bibilography.pdf | 739.97 kB | Adobe PDF | View/Open | |
10_annexure.pdf | 378.51 kB | Adobe PDF | View/Open | |
11_content.pdf | 381.74 kB | Adobe PDF | View/Open | |
12_list of graph and table.pdf | 367.39 kB | Adobe PDF | View/Open | |
80_recommendation.pdf | 412.41 kB | Adobe PDF | View/Open |
Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
Altmetric Badge: