The machine-learning based NER module requieres a different configuration file. It consists of the following sections:
<RGF>
contains one line with the path to the RGF file of the model. This file is the definition of the features that will be taken into account for NER. These features are processed by libfries.
<RGF> ner.rgf </RGF>
<AdaBoostModel>
contains one line with the path to the model file learnt with AdaBoost. These models are learnt and used by libomlet.
<AdaBoostModel> ner.abm </AdaBoostModel>
<Lexicon>
contains one line with the path to the lexicon file of the learnt model. The lexicon is used to translate string-encoded features generated by libfries to integer-encoded features needed by libomlet. The lexicon file is generated by libfries at training time.
<Lexicon> ner.lex </Lexicon>
<Classes>
contains only one line with the classes of the model and its translation to B, I, O tag.
<Classes> 0 B 1 I 2 O </Classes>
<InitialProb>
Contains the probabilities of seeing each class at the begining of a sentence. These probabilities are necessary for the Viterbi algorithm used to annotate NEs in a sentence.
<InitialProb> B 0.200072 I 0.0 O 0.799928 </InitialProb>
<TransitionProb>
Contains the transition probabilities for each class to each other class, used by the Viterbi algorithm.
<TransitionProb> B B 0.00829346 B I 0.395481 B O 0.596225 I B 0.0053865 I I 0.479818 I O 0.514795 O B 0.0758838 O I 0.0 O O 0.924116 </TransitionProb>
<TitleLimit>
contains only one line with an integer
value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper
noun. Example:
<TitleLimit> 3 </TitleLimit>
If TitleLimit=0
(the default) title detection is
deactivated (i.e, all-uppercase sentences are always marked as
named entities).
The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.
For instance, if TitleLimit=3
the sentence
FREELING ENTERS NASDAC UNDER CLOSE OBSERVATION OF MARKET ANALYSTS
will not be recognized as a named entity, and will have its words analyzed
independently. On the other hand, the sentence IBM INC., having less than
3 words, will be considered a proper noun.
Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour (which is not 100% accurate, either).
<SplitMultiwords>
contains only one line with either yes
or no
. If SplitMultiwords
is activated Named Entities still will be recognized but they will not be treated as a unit with only one Part-of-Speech tag for the whole compound. Each word gets its own Part-of-Speech tag instead.
NE_Tag
, The Part-of-Speech tags of non-capitalized words inside a Named Entity (typically, prepositions and articles) will be left untouched.
<SplitMultiwords> no </SplitMultiwords>
Lluís Padró 2010-09-02