Part-of-Speech Tagger Module
There are two different modules able to perform PoS tagging. The application should decide which method is to be used, and instantiate the right class.
The first PoS tagger is the hmm_tagger class, which is a classical trigam Markovian tagger, following [Bra00].
The second module, named relax_tagger, is a hybrid system capable to integrate statistical and hand-coded knowledge, following [Pad98].
The hmm_tagger module is somewhat faster than relax_tagger, but the later allows you to add manual constraints to the model. Its API is the following:
class hmm_tagger: public POS_tagger {
public:
/// Constructor
hmm_tagger(const std::string &, const std::string &, bool, unsigned int);
/// disambiguate given sentences
void analyze(std::list<sentence> &);
};
The hmm_tagger constructor receives the following parameters:
- The language code: Used to determine if the language uses an EAGLES tagset, and to properly shorten the PoS tags in that case.
- The HMM file, which containts the model parameters. The format
of the file is described below. This file can be generated from a
tagged corpus using the script src/utilitities/TRAIN provided
in FreeLing package. See comments in the script file to find out
which format the corpus is expected to have.
- A boolean stating whether words that carry retokenization information (e.g. set by the dictionary or affix handling modules) must be retokenized (that is, splitted in two or more words) after the tagging.
- An integer stating whether and when the tagger must select only one analysis in case of ambiguity. Possbile values are: FORCE_NONE (or 0): no selection forced, words ambiguous after the tagger, remain ambiguous. FORCE_TAGGER (or 1): force selection immediately after tagging, and before retokenization. FORCE_RETOK (or 2): force selection after retokenization.
The relax_tagger module can be tuned with hand written constraint, but is about 2 times slower than hmm_tagger.
class relax_tagger : public POS_tagger {
public:
/// Constructor, given the constraint file and config parameters
relax_tagger(const std::string &, int, double, double, bool, unsigned int);
/// disambiguate sentences
void analyze(std::list<sentence> &);
};
The relax_tagger constructor receives the following parameters:
- The constraint file. The format
of the file is described below. This file can be generated from a
tagged corpus using the script src/utilitities/TRAIN provided
in FreeLing package. See comments in the script file to find out
which format the corpus is expected to have.
- An integer stating the maximum number of iterations to wait for
convergence before stopping the disambiguation algorithm.
- A real number representing the scale factor of the constraint weights.
- A real number representing the threshold under which any changes
will be considered too small. Used to detect convergence.
- A boolean stating whether words that carry retokenization information (e.g. set by the dictionary or affix handling modules) must be retokenized (that is, splitted in two or more words) after the tagging.
- An integer stating whether and when the tagger must select only one analysis in case of ambiguity. Possbile values are: FORCE_NONE (or 0): no selection forced, words ambiguous after the tagger, remain ambiguous. FORCE_TAGGER (or 1): force selection immediately after tagging, and before retokenization. FORCE_RETOK (or 2): force selection after retokenization.
The iteration number, scale factor, and threshold parameters are very specific of the relaxation labelling algorithm. Refer to [Pad98] for details.
Subsections
Lluís Padró
2010-09-02