The tokenizer rules file is divided in three sections
<Macros>
, <RegExps>
and <Abbreviations>
. Each
section is closed by </Macros>
, </RegExps>
and
</Abbreviations>
tags respectively.
The <Macros>
section allows the user to define regexp macros
that will be used later in the rules. Macros are defined with a name and
a Perl regexp.
E.g. ALPHA [A-Za-z]
The <RegExps>
section defines the tokenization
rules. Previously defined macros may be referred to with their name
in curly brackets.
E.g. *ABREVIATIONS1 0 ((\{ALPHA\}+\.)+)(?!\.\.)
Rules are regular expressions, and are applied in the order of definition. The first rule matching the beginning of the line is applied, a token is built, and the rest of the rules are ignored. The process is repeated until the line has been completely processed.
The format of each rule is:
<Abbreviations>
section).
Apart from that, the rule name is only for informative/readability purposes.
The <Abbreviations>
section defines common abbreviations (one
per line) that must not be separated of their following dot
(e.g. etc., mrs.). They must be lowercased, even if
they are expected to appear uppercased in the text.
Lluís Padró 2010-09-02