Affixation Rules File

The submodule of the dictionary handler that deals with affixes requires a set of affixation rules.

The file consists of two (optional) sections: <Suffixes> and <Prefixes>. The first one contains suffixation rules, and the second, prefixation rules. They may appear in any order.

Both kinds of rules have the same format, and only differ in whether the affix is checked at the beggining or at the end of the word.

Each rule has to be written in a different line, and has 10 fields:

  1. Affix to erase form word form (e.g: crucecita - cecita = cru)
  2. Affix (* for emtpy string) to add to the resulting root to rebuild the lemma that must be searched in dictionary (e.g. cru + z = cruz)
  3. Condition on the parole tag of found dictionary entry (e.g. cruz is NCFS). The condition is a perl RegExp
  4. Parole tag for suffixed word (* = keep tag in dictionary entry)
  5. Check lemma adding accents
  6. Enclitic suffix (special accent behaviour in Spanish)
  7. Prevent later modules (e.g. probabilities) from assigning additional tags to the word
  8. Lemma to assign: Any combination of: F, R, L, A, or a string literal separated with a + sign. For instance: R+A, A+L, R+mente, etc.

    F stands for the original form (before affix removal, e.g. crucecitas), R stands for root found in dictionary (after affix removal and root reconstruction, e.g. cruces), L stands for lemma in matching dictionary entry (e.g. cruz), Astands for the affix that the rule removed

  9. Try the affix always, not only for unknown words.

  10. Retokenization info, explained below (``-'' for none)

E.g., prefix rules:
anti * ^NC AQ0CN0 0 0 1 A+L 0 -

This prefix rule states that anti should be removed from the beggining of the word, nothing (*) should be added, and the resulting root should be found in the dictionary with a NC PoS tag. If that is satisfied, the word would receive the AQ0CN0 tag and the affix (anti) plus the lemma as the lemma of the prefixed word. For instance, the word antimisiles would match this rule: misiles would be found in the dictionary with lema misil and PoS NCMP000. Then, the word will be assigned the lemma antimisil (A+L = anti+misil) and the tag AQ0CN0.

E.g., sufix rules:
cecita z|za ^NCFS NCFS00A 0 0 1 L 0 -
les * ^V * 0 1 0 L 1 $$+les:$$+PP

The first suffix rule above (cecita) states a suffix rule that will be applied to unknown words, to see whether a valid feminine singular noun is obtained when substituting the suffix cecita with z ot za. This is the case of crucecita (diminutive of cruz). If such a base form is found, the original word is analyzed as diminutive suffixed form. No retokenization is performed.

The second rule (les) applies to all words and tries to check whether a valid verb form is obtained when removing the suffix les. This is the case of words such as viles (which may mean I saw them, but also is the plural of the adjective vil). In this case, the retokenization info states that if eventually the verb tag is selected for this word, it may be retokenized in two words: The base verb form (referred to as $$, vi in the example) plus the word les. The tags for these new words are expressed after the colon: The base form must keep its PoS tag (this is what the second $$ means) and the second word may take any tag starting with PP it may have in the dictionary.

So, for word viles would obtain its adjective analysis from the dictionary, plus its verb + clitic pronoun from the suffix rule:
viles vil AQ0CP0 ver VMIS1S0

The second analysis will carry the retokenization information, so if eventually the PoS tagger selects the VMI analysis (and the TaggerRetokenize option is set), the word will be retokenized into:

   vi ver VMIS1S0
   les ellos PP3CPD00

Lluís Padró 2010-09-02