This section presents the options that can be given to the analyzer program (and thus, also to the analyzer_server program and to the analyze script). All options can be written in the configuration file as well as in the command line. The later has always precedence over the former.
Command line | Configuration file |
-h , --help |
N/A |
Prints to stdout a help screen with valid options and exits.
Command line | Configuration file |
-f <filename> |
N/A |
Specify configuration file to use (default: analyzer.cfg).
Command line | Configuration file |
-l <int> , --tlevel <int> |
TraceLevel=<int> |
Set the trace level (0 = no trace, higher values = more trace), for debugging purposes.
This will work only if the library was compiled with tracing information, using ./configure -enable-traces. Note that the code with tracing information is slower than the code compiled without it, even when traces are not active.
Command line | Configuration file |
-m <mask> , --tmod <mask> |
TraceModule=<mask> |
Specify modules to trace. Each module is identified with an hexadecimal flag. All flags may be OR-ed to specificy the set of modules to be traced.
Valid masks are:
Module | Mask |
Splitter | 0x00000001 |
Tokenizer | 0x00000002 |
Morphological analyzer | 0x00000004 |
Options management | 0x00000008 |
Number detection | 0x00000010 |
Date identification | 0x00000020 |
Punctuation detection | 0x00000040 |
Dictionary search | 0x00000080 |
Suffixation rules | 0x00000100 |
Multiword detection | 0x00000200 |
Named entity detection | 0x00000400 |
Probability assignment | 0x00000800 |
Quantities detection | 0x00001000 |
Named entity classification | 0x00002000 |
Automata (abstract) | 0x00004000 |
PoS Tagger (abstract) | 0x00008000 |
HMM tagger | 0x00010000 |
Relaxation labelling | 0x00020000 |
RL tagger | 0x00040000 |
RL tagger constr. grammar | 0x00080000 |
Sense annotation | 0x00100000 |
Chart parser | 0x00200000 |
Parser grammar | 0x00400000 |
Dependency parser | 0x00800000 |
Correference resolution | 0x01000000 |
Utilities | 0x02000000 |
Command line | Configuration file |
--lang <language> |
Lang=<language> |
Language of input text (es: Spanish, ca: Catalan, en: English, cy: Welsh, it: Italian, gl: Galician, pt: Portuguese).
Other languages may be added to the library. See chapter 6 for details.
Command line | Configuration file |
--flush , --noflush |
AlwaysFlush=(yes|y|on|no|n|off) |
When inactive (most usual choice) sentence splitter buffers lines until a sentence marker is found. Then, it outputs a complete sentence. When active, the splitter never buffers any token, and considers each newline as sentence end, thus processing each line as an independent sentence.
Command line | Configuration file |
--inpf <string> |
InputFormat=<string> |
Format of input data (plain, token, splitted, morfo, tagged, sense).
Command line | Configuration file |
--outf <string> |
OutputFormat=<string> |
Format of output data (token, splitted, morfo, tagged, parsed, dep).
Command line | Configuration file |
--train |
N/A |
When this option (only available at command line) is specified,
OutputFormat
is forced to tagged
and results are
printed in the format:
word lemma tag # lemma1 tag1 lemma2 tag2 ...that is, one word per line, with the selected lemma and tag as fields 2 and 3, a separator (
#
) and a list of all possible pairs
lemma-tag for the word (including the selected one).
This format is expected by the training scripts. Thus, this option can be used to annotate a corpus, correct the output manually, and use it to retrain the taggers with the script src/utilitities/TRAIN provided in FreeLing package. See comments in the script file for details about how to use it.
Command line | Configuration file |
--utf |
UseUTF=(yes|y|on|no|n|off) |
FreeLing internally uses latin1
ISO encoding.
When this option is active, the input is assumed to be in utf8
,
and converted to latin1
before processing. The results
are converted back to utf8
before being output.
Note that this will not work with utf8
encodings that cannot be
converted to latin1
.
When using the server version (see chapter 5), you can choose whether it will be the server or the client who will perform the conversion.
Specifying --utf
at the server command line
means that the server expects utf8
input, so the client must
send so (thus, no conversion must be carried out by the client).
Specifying --utf
at the client command line means
that the client will convert utf8
input to latin1
before
sending the request to the server. So, the server has to be
expecting latin1
input (i.e. no conversion in the server side).
Command line | Configuration file |
--abrev <filename> |
TokenizerFile=<filename> |
File of tokenization rules. See section 3.1 for details.
Command line | Configuration file |
--fsplit <filename> |
SplitterFile=<filename> |
File of splitter options rules. See section 3.2 for details.
Command line | Configuration file |
--afx , --noafx |
AffixAnalysis=(yes|y|on|no|n|off) |
Whether to perform affix analysis on unknown words. Affix analysis applies a set of affixation rules to the word to check whether it is a derived form of a known word.
Command line | Configuration file |
-S <filename> , --fafx <filename> |
AffixFile=<filename> |
Affix rules file. See section 3.7.2 for details.
Command line | Configuration file |
--loc , --noloc |
MultiwordsDetection=(yes|y|on|no|n|off) |
Whether to perform multiword detection. Multiwords may be detected if a multiword file is provided. Multiword File option, below).
Command line | Configuration file |
-L <filename> , --floc <filename> |
LocutionsFile=<filename> |
Multiword definition file. See section 3.8 for details.
Command line | Configuration file |
--numb , --nonumb |
NumbersDetection=(yes|y|on|no|n|off) |
Whether to perform nummerical expression detection. Deactivating this feature will affect the behaviour of date/time and ratio/currency detection modules.
Command line | Configuration file |
--dec <string> |
DecimalPoint=<string> |
Specify decimal point character for the number detection module (for instance, in English is a dot, but in Spanish is a comma).
Command line | Configuration file |
--thou <string> |
ThousandPoint=<string> |
Specify thousand point character for the number detection module (for instance, in English is a comma, but in Spanish is a dot).
Command line | Configuration file |
--punt , --nopunt |
PunctuationDetection=(yes|y|on|no|n|off) |
Whether to assign PoS tag to punctuation signs.
Command line | Configuration file |
-M <filename> , --fpunct <filename> |
PunctuationFile=<filename> |
Punctuation symbols file. See section 3.5 for details.
Command line | Configuration file |
--date , --nodate |
DatesDetection=(yes|y|on|no|n|off) |
Whether to perform date and time expression detection.
Command line | Configuration file |
--quant , --noquant |
QuantitiesDetection=(yes|y|on|no|n|off) |
Whether to perform currency amounts, physical magnitudes, and ratio detection.
Command line | Configuration file |
-Q <filename> , --fqty <filename> |
QuantitiesFile=<filename> |
Quantitiy recognition configuration file. See section 3.10 for details.
Command line | Configuration file |
--dict , --nodict |
DictionarySearch=(yes|y|on|no|n|off) |
Whether to search word forms in dictionary. Deactivating this feature also deactivates AffixAnalysis option.
Command line | Configuration file |
-D <filename> , --fdict <filename> |
DictionaryFile=<filename> |
Dictionary database. Must be a Berkeley DB indexed file. See section 3.7 and chapter 6 for details.
Command line | Configuration file |
--prob , --noprob |
ProbabilityAssignment=(yes|y|on|no|n|off) |
Whether to compute a lexical probability for each tag of each word. Deactivating this feature will affect the behaviour of the PoS tagger.
Command line | Configuration file |
-P <filename> , --fprob <filename> |
ProbabilityFile=<filename> |
Lexical probabilities file. The probabilities in this file are used to compute the most likely tag for a word, as well to estimate the likely tags for unknown words. See section 3.11 for details.
Command line | Configuration file |
--thres <float> |
ProbabilityThreshold=<float> |
Threshold that must be reached by the probability of a tag given the suffix of an unknown word in order to be included in the list of possible tags for that word. Default is zero (all tags are included in the list). A non-zero value (e.g. 0.0001, 0.001) is recommended.
Command line | Configuration file |
--ner [bio|basic|none] |
NERecognition=(bio|basic|none) |
Whether to perform NE recognition and which recognizer to use: ``bio'' for AdaBoost based NER, ``basic'' for a simple heuristic NE recognizer and ``none'' to perform no NE recognition . Deactivating this feature will cause the NE Classification module to have no effect.
Command line | Configuration file |
--ner [bio|basic|none] , --fnp <filename> |
NPDataFile=<filename> |
Configuration data file for active NE recognizer (either ``bio'' or ``basic''). See section 3.9 for details.
Command line | Configuration file |
--nec , --nonec |
NEClassification=(yes|y|on|no|n|off) |
Whether to perform NE classification.
Command line | Configuration file |
--fnec <filename> |
NECFilePrefix=<filename> |
Prefix to find files for Named Entity Classifier configuration.
The searched files will be the given prefix with the following extensions:
See section 3.16 for details.
Command line | Configuration file |
--sense <string> |
SenseAnnotation=<string> |
Kind of sense annotation to perform
Whether to perform sense anotation. If active, the PoS tag selected by the tagger for each word is enriched with a list of all its possible WN synsets. The sense repository used depends on the contents of the ``Sense Dictionary File'' described below.
Command line | Configuration file |
--fsense <filename> |
SenseFile=<filename> |
Word sense data file. It is a Berkeley DB indexed file. See section 3.20.1 for details.
Command line | Configuration file |
--dup , --nodup |
DuplicateAnalysis=(yes|y|on|no|n|off) |
When this option is set, the senses annotator will duplicate the analysis once for each of its possible senses. See section 3.13 for details.
This may be useful if one wants to perform WSD, or to use the sense field in the analysis in the constraint grammar (see section 3.15.2).
Command line | Configuration file |
-T <string> , --tag <string> |
Tagger=<string> |
Algorithm to use for PoS tagging
Command line | Configuration file |
-H <filename> , --hmm <filename> |
TaggerHMMFile=<filename> |
Parameters file for HMM tagger. See section 3.15.1 for details.
Command line | Configuration file |
-R <filename> |
TaggerRelaxFile=<filename> |
File containing the constraints to apply to solve the PoS tagging. See section 3.15.2 for details.
Command line | Configuration file |
--iter <int> |
TaggerRelaxMaxIter=<int> |
Maximum numbers of iterations to perform in case relaxation does not converge.
Command line | Configuration file |
--sf <float> |
TaggerRelaxScaleFactor=<float> |
Scale factor to normalize supports inside RL algorithm. It is comparable to the step lenght in a hill-climbing algorithm: The larger scale factor, the smaller step.
Command line | Configuration file |
--eps <float> |
TaggerRelaxEpsilon=<float> |
Real value used to determine when a relaxation labelling iteration has produced no significant changes. The algorithm stops when no weight has changed above the specified epsilon.
Command line | Configuration file |
--retk , --noretk |
TaggerRetokenize=(yes|y|on|no|n|off) |
Determine whether the tagger must perform retokenization after the appropriate analysis has been selected for each word. This is closely related to affix analysis and PoS taggers, see sections 3.7.2 and 3.15 for details.
Command line | Configuration file |
--force <string> |
TaggerForceSelect=(none,tagger,retok) |
Determine whether the tagger must be forced to (probably randomly) make a unique choice and when.
See 3.15 for more information.
Command line | Configuration file |
-G <filename> , --grammar <filename> |
GrammarFile=<filename> |
This file contains a CFG grammar for the chart parser, and some directives to control which chart edges are selected to build the final tree. See section 3.17.1 for details.
Command line | Configuration file |
-J <filename> , --dep <filename> |
DepRulesFile==<filename> |
Rules to be used to perform dependency analysis. See section 3.18.1 for details.
Lluís Padró 2010-09-02