Design and Implementation of  Parallel Hidden Web Crawler

Sonali Gupta

Please use this identifier to cite or link to this item: http://hdl.handle.net/10603/92173

Title:	Design and Implementation of Parallel Hidden Web Crawler
Researcher:	Sonali Gupta
Guide(s):	Dr. Komal Kumar Bhatia
Keywords:	Web Crawler
University:	YMCA University of Science and Technology
Completed Date:	2016
Abstract:	World Wide Web (WWW) is the largest repository of information that covers data from almost all the areas known to mankind. It is a source of information that is most frequently accessed publicly. This information over the WWW comprises of the hypertext markup language (HTML) documents interconnected through hyperlinks. The Surface Web or the Publically Indexable Web (PIW) includes the content that can be accessed by purely following the hyperlink structure and thus can be crawled and indexed by popular search engines. On the other hand, the Hidden Web refers to the content that is stored in Web databases and distributed through the creation of dynamic web pages. These dynamic web pages are generated based on the results retrieved in response to queries specified at the interface offered by the underlying web database. newlineCrawling the contents of the hidden Web is a very challenging problem especially because of the fundamental reasons of its scale and restricted search interfaces offered by the Web databases. To overcome the issue of scale, a parallel architecture of the Hidden Web crawler that seems to be a better option as compared to single-process crawler architecture has been proposed in this work. The proposed crawler is also targeted to automatically extract and integrate the search environment by modelling the search forms and filling them in to retrieve the Hidden Web contents from databases in different domains like Books, travel, Auto etc. But, when multiple instances of the crawler run in parallel, the same web document might be downloaded multiple times as one instance of the web crawler may not be aware of another having already downloaded the page. Thus, it is very important to minimize such multiple downloads to save network bandwidth and increase the crawler s effectiveness by coordinating the parallel processes must be coordinated to minimize overlap. However, the coordination between individual crawling processes needs communication which consumes network bandwidth. So, an important objective is
URI:	http://hdl.handle.net/10603/92173
Appears in Departments:	Department of Computer Engineering

Files in This Item:

File	Description	Size	Format
table of contents.docx	Attached File	11.39 kB	Microsoft Word XML	View/Open

Show full item record

Items in Shodhganga are licensed under Creative Commons Licence Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Altmetric Badge:

Shodhganga : a reservoir of Indian theses @ INFLIBNET