Format for .src dictionary files

Each line in the file must have the lemma-PoS list for one word. That is, each line has the format: form lemma1 PoS1 lemma2 PoS2 ....
E.g.:
casa casa NCFS000 casar VMIP3S0 casar VMM02S0
backs back NNS back VBZ

Whitespaces act as separators, so, make sure not to have extra whitespaces between fields or at the end of the line.

Lines corresponding to words that are contractions may have an alternative format if the contraction is to be splitted. The format is: form form1+form2+... PoS1+PoS2+....
For instance:
del de+el SPS+DA

This line expresses that whenever the form del is found, it is replaced with two words: de and el. Each of the new two word forms are searched in the dictionary, and assigned any tag matching their correspondig tag in the third field. So, de will be assigned all tags starting with SPS that this entry may have in the dictionary, and el will get any tag starting with DA.

Note that a contraction cannot be splitted in two different ways corresponding to different forms (e.g. he's = he+is | he+has), so only a combination of forms and a combination of tags may appear in the dictionary.

Nevertheless, a set of tags may be specified for a given form, e.g.:
he'd he+'d PRP+VB/MD

This will produce two words: he with PRP analysis, and 'd with its analysis matching any of the two given tags (i.e. have_VBZ and would_MD). Note that this will work only if the form 'd is found in the dictionary with those possible analysis.

If all tags for one of the new forms are to be used, a wildcard may be written as a tag. E.g.:
pal para+el SPS+*

This will replace pal with two words, para with only its SPS analysis, plus el with all its possible tags.

Lluís Padró 2010-09-02