Crawling the Hidden Web

Crawling the Hidden Web

Roma, Italy, 2001 | Sriram Raghavan, Hector Garcia-Molina
This paper presents a method for crawling the hidden Web, which consists of Web pages that are not accessible through standard search engines. Current crawlers only access the publicly indexable Web, which includes pages reachable through hypertext links. However, a significant portion of Web content is hidden behind search forms and requires authorization or registration to access. The authors introduce a prototype crawler called HiWE (Hidden Web Exposer) that can extract content from these hidden Web pages. They also propose a new technique called LITE (Layout-based Information Extraction Technique) for automatically extracting semantic information from search forms and response pages. HiWE is designed to crawl and extract content from hidden Web databases. It uses a generic operational model of a hidden Web crawler, which includes components such as internal form representation, a task-specific database, a matching function, and response analysis. The task-specific database is organized in terms of a finite set of concepts or categories, each associated with labels. The matching function attempts to match form labels with database labels to compute a set of candidate value assignments. The authors also introduce performance metrics for hidden Web crawlers, including submission efficiency, which measures the ratio of successful form submissions to total submissions. They discuss the challenges of designing a hidden Web crawler, including the need to automatically parse and process form-based search interfaces and to provide input in the form of search queries. The paper also describes experiments conducted to test and validate the techniques. The results show that HiWE can effectively crawl and extract content from hidden Web databases. The authors conclude that their approach is effective for crawling the hidden Web and that their operational model sets the stage for designing a variety of hidden Web crawlers.This paper presents a method for crawling the hidden Web, which consists of Web pages that are not accessible through standard search engines. Current crawlers only access the publicly indexable Web, which includes pages reachable through hypertext links. However, a significant portion of Web content is hidden behind search forms and requires authorization or registration to access. The authors introduce a prototype crawler called HiWE (Hidden Web Exposer) that can extract content from these hidden Web pages. They also propose a new technique called LITE (Layout-based Information Extraction Technique) for automatically extracting semantic information from search forms and response pages. HiWE is designed to crawl and extract content from hidden Web databases. It uses a generic operational model of a hidden Web crawler, which includes components such as internal form representation, a task-specific database, a matching function, and response analysis. The task-specific database is organized in terms of a finite set of concepts or categories, each associated with labels. The matching function attempts to match form labels with database labels to compute a set of candidate value assignments. The authors also introduce performance metrics for hidden Web crawlers, including submission efficiency, which measures the ratio of successful form submissions to total submissions. They discuss the challenges of designing a hidden Web crawler, including the need to automatically parse and process form-based search interfaces and to provide input in the form of search queries. The paper also describes experiments conducted to test and validate the techniques. The results show that HiWE can effectively crawl and extract content from hidden Web databases. The authors conclude that their approach is effective for crawling the hidden Web and that their operational model sets the stage for designing a variety of hidden Web crawlers.
Reach us at info@study.space