The first module in the chain is the tokenizer. It converts plain text to a vector of word objects, according to a set of tokenization rules.
Tokenization rules are regular expressions that are matched against the beggining of the text line being processed. The first matching rule is used to extract the token, the matching substring is deleted from the line, and the process is repeated until the line is empty.
The API of the tokenizer module is the following:
class tokenizer { public: /// Constructor, receives the name of the file with tokenization rules tokenizer(const std::string &); /// tokenize string with default options std::list<word> tokenize(const std::string &); /// tokenize string with default options, accumulating byte-offset of words std::list<word> tokenize(const std::string &, unsigned long &); };
That is, once created, the tokenizer module receives plain text in a string, tokenizes it, and returns a list of word objects corresponding to the created tokens