Tokenizer Rules File

The tokenizer rules file is divided in three sections <Macros>, <RegExps> and <Abbreviations>. Each section is closed by </Macros>, </RegExps> and </Abbreviations> tags respectively.

The <Macros> section allows the user to define regexp macros that will be used later in the rules. Macros are defined with a name and a Perl regexp.
E.g. ALPHA [A-Za-z]

The <RegExps> section defines the tokenization rules. Previously defined macros may be referred to with their name in curly brackets.
E.g. *ABREVIATIONS1 0 ((\{ALPHA\}+\.)+)(?!\.\.)

Rules are regular expressions, and are applied in the order of definition. The first rule matching the beginning of the line is applied, a token is built, and the rest of the rules are ignored. The process is repeated until the line has been completely processed.

The format of each rule is:

The <Abbreviations> section defines common abbreviations (one per line) that must not be separated of their following dot (e.g. etc., mrs.). They must be lowercased, even if they are expected to appear uppercased in the text.

Lluís Padró 2010-09-02