Understanding Crawling the Hidden Web

The paper "Crawling the Hidden Web" by Sriram Raghavan and Hector Garcia-Molina addresses the challenge of extracting content from the hidden web, which consists of pages behind search forms and authorized access. Traditional crawlers only retrieve content from the publicly indexable web (PIW), ignoring a significant portion of the web that is "hidden" in large searchable databases. The authors propose a task-specific, human-assisted approach to crawling the hidden web, focusing on selectively extracting content based on specific applications or tasks. They introduce HiWE (Hidden Web Exposer), a prototype crawler that uses a generic operational model and a new technique called LITE (Layout-based Information Extraction Technique) to automatically extract semantic information from search forms and response pages. The paper also discusses the design of HiWE, including its architecture, form representation, task-specific database, matching function, and response analysis. Experiments demonstrate the feasibility and effectiveness of the proposed techniques, showing high submission efficiency and successful form processing. The authors conclude by highlighting the potential for further improvements in HiWE, particularly in handling dependencies between form elements and partial form filling.The paper "Crawling the Hidden Web" by Sriram Raghavan and Hector Garcia-Molina addresses the challenge of extracting content from the hidden web, which consists of pages behind search forms and authorized access. Traditional crawlers only retrieve content from the publicly indexable web (PIW), ignoring a significant portion of the web that is "hidden" in large searchable databases. The authors propose a task-specific, human-assisted approach to crawling the hidden web, focusing on selectively extracting content based on specific applications or tasks. They introduce HiWE (Hidden Web Exposer), a prototype crawler that uses a generic operational model and a new technique called LITE (Layout-based Information Extraction Technique) to automatically extract semantic information from search forms and response pages. The paper also discusses the design of HiWE, including its architecture, form representation, task-specific database, matching function, and response analysis. Experiments demonstrate the feasibility and effectiveness of the proposed techniques, showing high submission efficiency and successful form processing. The authors conclude by highlighting the potential for further improvements in HiWE, particularly in handling dependencies between form elements and partial form filling.

Crawling the Hidden Web

Roma, Italy, 2001 | Sriram Raghavan, Hector Garcia-Molina