tokenizer Class Reference

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...

#include <tokenizer.h>

List of all members.

Public Member Functions

 tokenizer (const std::string &)
 Constructor.
void tokenize (const std::string &, std::list< word > &)
 tokenize string
std::list< word > tokenize (const std::string &)
 tokenize string, return result as list
void tokenize (const std::string &, unsigned long &, std::list< word > &)
 tokenize string, tracking offset
std::list< word > tokenize (const std::string &, unsigned long &)
 tokenize string, tracking offset, return result as list

Private Attributes

std::set< std::string > abrevs
 abreviations set (Dr. Mrs. etc. period is not separated)
std::list< std::pair
< std::string, RegEx > > 
rules
 tokenization rules
std::map< std::string, int > matches
 substrings to convert into tokens in each rule

Detailed Description

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.


Constructor & Destructor Documentation

tokenizer::tokenizer ( const std::string &  TokFile  ) 

Constructor.

Create a tokenizer, using the abreviation and patterns file indicated in given options.

References abrevs, ERROR_CRASH, matches, rules, and TRACE.


Member Function Documentation

std::list<word> tokenizer::tokenize ( const std::string &  ,
unsigned long &   
)

tokenize string, tracking offset, return result as list

void tokenizer::tokenize ( const std::string &  ,
unsigned long &  ,
std::list< word > &   
)

tokenize string, tracking offset

std::list<word> tokenizer::tokenize ( const std::string &   ) 

tokenize string, return result as list

void tokenizer::tokenize ( const std::string &  ,
std::list< word > &   
)

tokenize string


Member Data Documentation

std::set<std::string> tokenizer::abrevs [private]

abreviations set (Dr. Mrs. etc. period is not separated)

Referenced by tokenizer().

std::map<std::string,int> tokenizer::matches [private]

substrings to convert into tokens in each rule

Referenced by tokenizer().

std::list<std::pair<std::string,RegEx> > tokenizer::rules [private]

tokenization rules

Referenced by tokenizer().


The documentation for this class was generated from the following files:
Generated on Tue Jul 27 16:29:34 2010 for FreeLing by  doxygen 1.6.3