Wrapper Induction for Information Extraction

Wrapper Induction for Information Extraction

1997 | Nicholas Kushmerick, Daniel S. Weld, Robert Doorenbos
The paper introduces wrapper induction, a method for automatically constructing wrappers to extract relational data from Internet resources. Wrappers are procedures that translate query responses into relational form, but they are typically hand-coded, which is tedious and error-prone. The authors propose wrapper induction, which learns wrappers by generalizing from example query responses. They identify HLRT (head-left-right-tail) wrappers as a class that is efficiently learnable and expressive enough to handle 48% of surveyed Internet resources. The system uses PAC analysis to bound the sample complexity and demonstrates graceful degradation with imperfect labeling knowledge. HLRT wrappers are designed for tabular layouts and use delimiters to extract information. The paper also discusses the composition of heuristics to form a labeling oracle and presents empirical evaluations showing the feasibility and robustness of the approach.The paper introduces wrapper induction, a method for automatically constructing wrappers to extract relational data from Internet resources. Wrappers are procedures that translate query responses into relational form, but they are typically hand-coded, which is tedious and error-prone. The authors propose wrapper induction, which learns wrappers by generalizing from example query responses. They identify HLRT (head-left-right-tail) wrappers as a class that is efficiently learnable and expressive enough to handle 48% of surveyed Internet resources. The system uses PAC analysis to bound the sample complexity and demonstrates graceful degradation with imperfect labeling knowledge. HLRT wrappers are designed for tabular layouts and use delimiters to extract information. The paper also discusses the composition of heuristics to form a labeling oracle and presents empirical evaluations showing the feasibility and robustness of the approach.
Reach us at info@study.space
[slides and audio] Wrapper Induction for Information Extraction