NEC Data File

The Named Entity Classification module requires three configuration files, with the same path and name, with suffixes .rgf, .lex, and .abm. Only the basename must be given as a parameter at instantiation time, file extensions are automatically added.

The .abm file contains an AdaBoost model based on shallow Decision Trees (see [CMP03] for details). You don't need to understand this, unless you want to enter into the code of the AdaBoost classifier.

The .lex file is a dictionary that assigns a number to each symbolic feature used in the AdaBoost model. You don't need to understand this either unless you are a Machine Learning student or the like.

Both .abm and .lex files may be generated from an annotated corpus using the training programs in libomlet package.
(see http://www.lsi.upc.edu/~ nlp/omlet+fries)

The important file in the set is the .rgf file. This contains a definition of the context features that must be extracted for each named entity. The feature extraction language is that of [RCSY04] with some useful extensions.

If you need to know more about this (e.g. to develop a NE classifier for your language) please contact FreeLing authors.

Lluís Padró 2010-09-02