The file that controls the behaviour of the simple NE recognizer consists of the following sections:
<FunctionWords>
lists the function words that can be
embeeded inside a proper noun (e.g. preposisions and articles such
as those in ``Banco de España'' or ``Foundation for the Eradication
of Poverty''). For instance:
<FunctionWords> el la los las de del para </FunctionWords>
<SpecialPunct>
lists the PoS tags (according to
punctuation tags definition file, section 3.5) after
which a capitalized word may be indicating just a sentence or clause
beggining and not necessarily a named entity. Typical cases are
colon, open parenthesis, dot, hyphen..
<SpecialPunct> Fpa Fp Fd Fg </SpecialPunct>
<NE_Tag>
contains only one line with the PoS tag that
will be assigned to the recognized entities. If the NE classifier is
going to be used later, it will have to be informed of this tag at
creation time.
<NE_Tag> NP00000 </NE_Tag>
<Ignore>
contains a list of forms (lowercased)
or PoS tags (uppercased) that are not to be considered a named
entity even when they appear capitalized in the middle of a
sentence. For instance, the word Spanish in the sentence
He started studying Spanish two years ago is not a named
entity. If the words in the list appear with other capitalized
words, they are considered to form a named entity (e.g. An
announcement of the Spanish Bank of Commerce was issued
yesterday). The same distinction applies to the word I in
the sentences whatever you say, I don't believe, and That was the death of Henry I.
Each word or tag is followed by a or
indicating whether
the ignore condition is strict (
: non-strict,
:
strict). The entries marked as non-strict will have the
behaviour described above. The entries marked as strict will
never be considered named entities or NE parts.
For instance, the following <Ignore>
section states that
the word ``I'' is not to be a proper noun (whatever you say,
I don't believe) unless some of its neighbour words are ( That was the death of Henry I). It also states that any word
with the RB tag, and any of the listed language names must
never be considered as possible NEs.
<Ignore> i 0 RB 1 english 1 dutch 1 spanish 1 </Ignore>
<Names>
contains a list of lemmas that may be names, even if they conflict with some of the heuristic criteria used by the NE recognizer. This is useful when they appear capitalized at sentence beggining. For instance, the basque name Miren (Mary) or the nickname Pelé may appear at the beggining of a Spanish sentence. Since both of them are verbal forms in Spanish, they would not be considered candidates to form named entities.
Including the form in the <Names>
section, causes the NE choice to be added to the possible tags of the form, giving the tagger the chance to decide whether it is actually a verb or a proper noun.
<Names> miren pelé zapatero china </Names>
<RE_NounAdj>
<RE_Closed>
and <RE_DateNumPunct>
allow to modify the default regular expressions for PAROLE Part-of-Speech tags. This regular expressions are used by the NER to determine whether a sentence-beginning word has some tag that is Noun or Adj, or any tag that is a closed category, or one of date/punctuation/number. The default is to check against PAROLE tags, thus, the recognizer will fail to identifiy these categories if your dictionary uses another tagset, unless you specify the right patterns to look for.
For instance, if our dictionary uses Penn-Treebank-like tags, we should define:
<RE_NounAdj> ^(NN$|NNS|JJ) </RE_NounAdj> <RE_Closed> ^(D|IN|C) </RE_Closed>
<TitleLimit>
contains only one line with an integer
value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper
noun. Example:
<TitleLimit> 3 </TitleLimit>
If TitleLimit=0
(the default) title detection is
deactivated (i.e, all-uppercase sentences are always marked as
named entities).
The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.
For instance, if TitleLimit=3
the sentence
FREELING ENTERS NASDAC UNDER CLOSE OBSERVATION OF MARKET ANALYSTS
will not be recognized as a named entity, and will have its words analyzed
independently. On the other hand, the sentence IBM INC., having less than
3 words, will be considered a proper noun.
Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour (which is not 100% accurate, either).
<SplitMultiwords>
contains only one line with either yes
or no
. If SplitMultiwords
is activated Named Entities still will be recognized but they will not be treated as a unit with only one Part-of-Speech tag for the whole compound. Each word gets its own Part-of-Speech tag instead.
NE_Tag
, The Part-of-Speech tags of non-capitalized words inside a Named Entity (typically, prepositions and articles) will be left untouched.
<SplitMultiwords> no </SplitMultiwords>
LluÃs Padró 2010-09-02