Number Detection Module

The number detection module is language dependent: It recognizes nummerical expression (e.g.: 1,220.54 or two-hundred sixty-five), and assigns them a normalized value as lemma.

The module is basically a finite-state automata that recognizes valid nummerical expressions. Since the structure of the automata and the actions to compute the actual nummerical value are different for each lemma.

For languages that do not have an implementation of a specific automata, a generic module is used to recognize number-like expressions that contain nummerical digits.

For the reasons described so far, there is no options or configuration file to be provided to the class when it is instantiated. The API of the class is:

  
class numbers {
   public:
      /// Constructor: receives the language code, and the decimal 
      /// and thousand point symbols
      numbers(const std::string &, const std::string &, const std::string &); 

      /// Detect number expressions in given sentence
      void annotate(sentence &);
};

The parameters that the constructor expects are:

The last two parameters are needed because in some latin languages, the comma is used as decimal point separator, and the dot as thousand mark, while in languages like English it is the other way round. These parameters make it possible to specify what character is to be expected at each of these positions. They will usually be comma and dot, but any character could be used.

Lluís Padró 2010-09-02