15th European Conference on Artificial Intelligence
|
July 21-26 2002 Lyon France |
[full paper] |
Benjamin Habegger, Mohamed Quafafou
Numerous sources of data are available on the web, for instance, product catalogs, multiple directories, conference and event sites, etc. The extraction of information from the content of these sources is a challenging problem and a hard task since they are heterogeneous and dynamic. This paper presents a new method for extracting wrappers and relations from the web using both page encoding and context generalization. Its starting point is a training set of instances of the relation the user wishes to extract. Multiple patterns are then extracted considering the occurrences of the input instances in the data source. The generalization of these patterns allows us to identify new occurrence of the relation in the same data source. The main features of this method are its genericity and robustness faced to the diversity of sources. Its efficiency is shown by the experimental results on different sources, i.e., search engines, shopping, product catalogs, paper listings, etc.
Keywords: Information Extraction, Wrapper Induction, Relation Discovery, Machine Learning
Citation: Benjamin Habegger, Mohamed Quafafou: Multi-Pattern Wrappers for Relation Extraction from the Web. In F. van Harmelen (ed.): ECAI2002, Proceedings of the 15th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2002, pp.395-399.