[slides and audio] Wrapper Induction for Information Extraction

This paper introduces wrapper induction, a method for automatically constructing wrappers to extract relational data from Internet resources. The authors propose HLRT, a wrapper class that is efficiently learnable and expressive enough to handle 48% of a surveyed sample of Internet resources. They use PAC analysis to bound the sample complexity of the learning process and show that the system degrades gracefully with imperfect labeling knowledge. The paper describes how to construct wrappers by induction, focusing on HLRT wrappers that use a head-left-right-tail (HLRT) approach to delimit extracted text. The authors develop an algorithm for building HLRT wrappers, which iterates over all possible choices for the delimiters until a consistent wrapper is found. They also describe how to compose an oracle from heuristic knowledge to label examples. The authors also address the problem of handling imperfect recognizers, which are used to identify instances of particular attributes on a page. They describe a corroboration algorithm that uses these recognizers to build a label array, resolving ambiguities by considering the possible locations of instances. They also extend the PAC model to accommodate missing and ambiguous attributes. The paper presents an empirical evaluation of the approach, showing that HLRT wrapper induction is practical even for high error rates. The authors also describe a wrapper induction environment called WIEN, which allows users to label example pages and learn a wrapper for the resource. The paper concludes that wrapper induction is a new technique for automatically constructing wrappers and that the authors have made three contributions: formalizing the wrapper construction problem as induction, defining the HLRT bias, and showing how to use heuristic knowledge to compose the algorithm's oracle. They also discuss related work and future research directions.This paper introduces wrapper induction, a method for automatically constructing wrappers to extract relational data from Internet resources. The authors propose HLRT, a wrapper class that is efficiently learnable and expressive enough to handle 48% of a surveyed sample of Internet resources. They use PAC analysis to bound the sample complexity of the learning process and show that the system degrades gracefully with imperfect labeling knowledge. The paper describes how to construct wrappers by induction, focusing on HLRT wrappers that use a head-left-right-tail (HLRT) approach to delimit extracted text. The authors develop an algorithm for building HLRT wrappers, which iterates over all possible choices for the delimiters until a consistent wrapper is found. They also describe how to compose an oracle from heuristic knowledge to label examples. The authors also address the problem of handling imperfect recognizers, which are used to identify instances of particular attributes on a page. They describe a corroboration algorithm that uses these recognizers to build a label array, resolving ambiguities by considering the possible locations of instances. They also extend the PAC model to accommodate missing and ambiguous attributes. The paper presents an empirical evaluation of the approach, showing that HLRT wrapper induction is practical even for high error rates. The authors also describe a wrapper induction environment called WIEN, which allows users to label example pages and learn a wrapper for the resource. The paper concludes that wrapper induction is a new technique for automatically constructing wrappers and that the authors have made three contributions: formalizing the wrapper construction problem as induction, defining the HLRT bias, and showing how to use heuristic knowledge to compose the algorithm's oracle. They also discuss related work and future research directions.

Wrapper Induction for Information Extraction

1997 | Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos