A comparative analysis of grammar extraction from text corpora

Alessandro Mazzei, Vincenzo Lombardo

This paper addresses the issue of what type of grammar knowledge is extracted from corpora. In particular the paper analyzes two lexicalized tree adjoining grammars (LTAG) extracted from two types of corpora, respectively a collection of newspaper articles and a collection of laws from the civil code. In order to compare the two grammars extracted we have implemented a coverage test: the grammar extracted from one corpus has been applied to the other corpus. The results have been that the civil code grammar covers the 66% of the newspaper corpus, while the newspaper grammar only covers the 46% of the civil code corpus, revealing a wider coverage for the civil code grammar. An explanation of the results relies on a deeper analysis of the grammar rules: the newspaper grammar is larger than the civil code grammar in termumber of rules as of different rules, but a reduced representative of the language, while the others are more specific to some context; the civil code grammar is smaller in terms of different rules, but these are more representative of the language and are more uniformly distributed in the corpus sentences. The conclusion is that the grammar knowledge from the civil code can be more easily exportable to tasks concerning other types of corpora. In order to validate this conclusion we have implemented a test based on the results of a rule-based parser, that have confirmed the greater generality of the civil code grammar.

Keywords: Natural language processing, Grammar extraction, Parsing and Coverage

Citation: Alessandro Mazzei, Vincenzo Lombardo: A comparative analysis of grammar extraction from text corpora . In R.López de Mántaras and L.Saitta (eds.): ECAI2004, Proceedings of the 16th European Conference on Artificial Intelligence, IOS Press, Amsterdam, 2004, pp.601-605.

