Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...
#include <tokenizer.h>
Public Member Functions | |
tokenizer (const std::string &) | |
Constructor. | |
void | tokenize (const std::string &, std::list< word > &) |
tokenize string | |
std::list< word > | tokenize (const std::string &) |
tokenize string, return result as list | |
void | tokenize (const std::string &, unsigned long &, std::list< word > &) |
tokenize string, tracking offset | |
std::list< word > | tokenize (const std::string &, unsigned long &) |
tokenize string, tracking offset, return result as list | |
Private Attributes | |
std::set< std::string > | abrevs |
abreviations set (Dr. Mrs. etc. period is not separated) | |
std::list< std::pair < std::string, RegEx > > | rules |
tokenization rules | |
std::map< std::string, int > | matches |
substrings to convert into tokens in each rule |
Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.
tokenizer::tokenizer | ( | const std::string & | TokFile | ) |
Constructor.
Create a tokenizer, using the abreviation and patterns file indicated in given options.
References abrevs, ERROR_CRASH, matches, rules, and TRACE.
std::list<word> tokenizer::tokenize | ( | const std::string & | , | |
unsigned long & | ||||
) |
tokenize string, tracking offset, return result as list
void tokenizer::tokenize | ( | const std::string & | , | |
unsigned long & | , | |||
std::list< word > & | ||||
) |
tokenize string, tracking offset
std::list<word> tokenizer::tokenize | ( | const std::string & | ) |
tokenize string, return result as list
void tokenizer::tokenize | ( | const std::string & | , | |
std::list< word > & | ||||
) |
tokenize string
std::set<std::string> tokenizer::abrevs [private] |
abreviations set (Dr. Mrs. etc. period is not separated)
Referenced by tokenizer().
std::map<std::string,int> tokenizer::matches [private] |
substrings to convert into tokens in each rule
Referenced by tokenizer().
std::list<std::pair<std::string,RegEx> > tokenizer::rules [private] |
tokenization rules
Referenced by tokenizer().