Zettair can build inverted indexes by parsing different types of source collections. Please read the format descriptions to understand fully how an index is constructed from the given data. Currently, the following index types are supported:
Usage: zet -i file1 ... fileN
Index construction options
Put Zettair into index construction mode (as opposed to searching mode).
The given files (file1 ... fileN) are files to index for searching. If no files are given then a list of filenames, seperated by whitespace, is read from stdin. This allows you to pipe a list of filenames to index in from a file or shell command. The command:
would find all files with c and h extensions and index them, placing the result into a set of files that start with source_index.
give the name of the index to use. If no name is given, 'index' will be used as the default. The prefix can include directory path components.
use this configuration file for the parser. The configuration file determines which tags the parser attempts to extract text from. The format is a simple text file where the name of a tag (minus the angled brackets) is followed by a number that indicates whether parsing should be turned on or off after this tag. See config/psettings.xml for an example.
causes zettair to use around 500MB of memory during indexing (by default, around 20MB is used)
allow zettair to add new postings to an existing index. By default, this causes an error.
Use given stemming algorithm during index construction. None is no stemming. eds removes 'e', 'ed', and 's'. light is a custom stemmer that is fast, but slightly less effective than Porter's stemming. Porter's stemming is a slow, complex, well-known stemming algorithm.
Generate impact-ordered inverted lists during construction. This is required to use impact-ordered evaluation during querying.
select the type of the index, TREC or HTML (default: autodetect)
Sample Command Line:
This command will use the TREC parser to create an inverted index from the four listed files. You should then find the following index files:
The HTML parser treats each file as one document in HTML format. Text is extracted from HTML documents according to the parser settings file, documented above.
It is often advantageous to combine several (thousand) documents in one file and be able to index and search on one single file rather than a few thousand files. This can be done by writing the information of several files into one file and formatting the one file in such a way that original document boundaries can be detected by the parser. The parser will extract words from the given file in much the same way as in HTML mode. Additionally, the TREC parser looks for tags: <DOC> and </DOC> to signal the beginning or end of a document, and identifies the documents via their TREC document number, which is found between a <DOCNO> and </DOCNO> tags. The TREC format is named as such because it is the format used by the Text Retrieval Conference (TREC) for experimental data.
The following excerpt from the Bible represents, for instance, 8 documents (of which 4 documents contain only one word).
<DOC> And the sons of Noah, that went forth of the ark, were Shem, and Ham,
|