|
[full paper] |
Georgios Petasis, Vangelis Karkaletsis, Claire Grover, Benjamin Hachey, Maria-Teresa Pazienza, Michele Vindigni, Jose Coch
The identification of interesting web sites and web pages and the extraction of information from them is an interesting but complex task. Most of the information on the web today is in the form of HTML documents, which are designed for presentation purposes and not for machine understanding and reasoning. The extraction task becomes even harder in a multilingual context, where web pages in different languages need to be analysed. The majority of existing systems needs to be manually configured for new domains, a process that requires substantial effort and time. This paper presents an adaptive, multilingual named entity recognition and classification (NERC) technology, which can be easily customised to new domains and to new languages. Our evaluation results demonstrate the viability of our approach.
Keywords: information extraction, named entity recognition, machine learning, multilinguality
Citation: Georgios Petasis, Vangelis Karkaletsis, Claire Grover, Benjamin Hachey, Maria-Teresa Pazienza, Michele Vindigni, Jose Coch: Adaptive, Multilingual Named Entity Recognition in Web Pages. In R.López de Mántaras and L.Saitta (eds.): ECAI2004, Proceedings of the 16th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2004, pp.1073-1074.