Author: Christian Wittern, Taipei -- Created: 99-07-20, Last change: 00-01-10
URL of this document: http://www.chibs.edu.tw/~chris/smart/smindex.htm

Technical note:

SMART: Format of the Index Files

Overview

As the name suggests, retrieval plays a very important part in SMART. Retrieval is based on a new type of index, that places a layer of abstraction between the actual locations in electronic documents and the positions in the texts referenced.

The SMART project tries to develop a suite of general tools for work on East Asian texts. Some assumptions are made about the texts:

They should be encoded in XML according to the TEI guidelines.
The milestone tags (<pb>, <lb>) down to the line level.

Retrieval in SMART is based on index files. These files serve two separate purposes:

Allow for fast access to any portion of the text and some basic context information independently of the format.
Reconstruct a flat version for each of the witnesses, that are recorded as variant readings in the XML files.

While the first point might be self-explanatory, I will give some explanation of the second point. Here is a short extract of a text in XML format. Variants are surrounded by the <app> element, with the text version of the base text (Taisho) marked with the <lem> element, other versions are marked with the <rdg> element, the attribute wit indicates the name of the witness.

As can be seen from this example, some characters (marked with the <rdg> element) fall out of the sequential structure of the text, they can be seen as stacked on top of the primary version of the text (marked with the <rdg> element). For the building of the index, additional lines will be inserted into the index, to account for these additional characters. An example is shown here:

In this example, an index with two additional characters has been buildt, the meaning of the numbers to the left of the characters is explained below.

The index consists of three or four different data files:

A dictionary file.
A binary index to the dictionary file.
A file that records mappings between the text identifiers and filenames.
One or more files that record mappings between the logical and physical location of lines. This is not always necessary.

Format of the dictionary file

The dictionary file simply contains the entries to be indexed one per line. The encoding is in UTF-16 (which is identical to UCS-2, the most popular Unicode encoding for most practical purposes). Mojikyo characters (that is, characters defined by the Mojikyo Font Institute) are using the Shift-Mojikyo encoding developed by Mr. Masahiko Maedera. The dictionary entries are separated by a tab character from the character addresses (locations), which are then following up to the end of line.

The file begins with the byte order mark FFFE followed by DIC and a digit indicating the format type (eg. 1) and a line-end (0D000A00) sequence.

Format 1:

The index length of the index term is variable, it is terminated by a tab character (0900). Following is a sequence of 6 byte location addresses, up to the end of the line.

The last adress is followed by a newline character. The adress is always given for the first Character; if preceding characters are given, for e.g. KWIC they should be appended at the end,separated from the other characters by a comma (2C00) character.

Example:
[U4e00][U4e00][U4e01] 00b4e8c6e63300b4c6e933e7 2 instances
...

Format of the character address:

The logical position of each character in the text, along with some other information, is expressed with the character address (I will call this location from now on). This location consists of a 15 digit decimal number, which encodes this information in its positions. Lets look at an example first:

The first character in the line T10n0279_p0001a01 is expressed as 00010027900000110101, which breaks down to:


T10n0279_p0001a01 =          010 0001 10101 000
                             010                  Volume    = 010 (maximum: 280 volumes)
                                 0001             Page      = p0001
                                      1           Cataster  = a
                                       01         Line      = 01
                                         01       Character = n/a 
                                            0     RDG       = 0=T, 1=S, 2=Y, 3=M  4=3, 5=G,9=others
                                             0    BaseTag   = 0=P, 1=L, 2=Z, 3=J, 4=Byline, 5=Comm, 6=Note
                                              0   ExtTag    = user defined... Name, Yin, etc.

As can be seen, the digit breaks down to different units, which have different semantics assigned to it. As can be seen, the first 12 digits are used to encode the character address in the text, expressed in the units of volume (3 digits), page (4 digits), cataster (1 digit), line (2 digits) and character number on the line (2 digits). These units are of course completely arbitrary and they can be redefined as needed.

The last three digits, are defined below the horizontal bar, they provided some information about the context of the character in question, derived from the XML markup. Provided is information about whether the character is

31.12.99: Alternatively, the cataster number can be used as a flag. I am using it now to indicate whether the character is the last in the line: For this purpose, the value of the cataster indicator is increasing in increments of 3. Adding 1 to the original value will indicate the last character of a line. This is important to be able to calculate the distance in the text of two characters for word-clustering and approximity searches.

The maximum is now 085146411005000 which can be expressed in 6 bytes. This will need an extra table for lookup of the text number.

     010000110101000
     085146411005000
     281474976710656   65536 by 3

Alternatively, there can also be

T10n0279_p0001a01 = 00010027900000110101
                        0                       Var.flag    = 0:regular, 1: variant
                         0                      reserved
                          0                     TaishoEd    = T  (others not yet defined)
                           10                   Volume      = 10
                             0279               Number      = n0279
                                  00            add. Letter = _
                                    0001        Page        = p0001
                                         1      Cataster    = a
                                          01    Line        = 01
                                             01 Character   = n/a

or T85n2920_p1464a10 = 00085292000146411005

The maximum is 085292000146411005, which can be expressed with 8 bytes. Every character in the Taisho can be addressed with this.

Format of the index file

This is a binary (non-text) file consisting of long (i.e. four-byte) integers. These are offsets to the starting bytes of the lines or words or tokens within the dictionary file which can be looked up. The offsets are sorted according to the lexical value (i.e. collating sequence) of the tokens to which they point.

This index format is modelled after a format used by Jim Breen for his EDICT dictionary.

List of texts

Gives first (and last?) lines of the texts, followed by text title and any other information. A hierarchy of lower level textual elements can be constructed by indenting the file with one space/tab for every hierarchical level.

Mappings to file-offsets

This gives the position of the beginning of each line and the filename, where this is found. The receiving application might have to strip some markup, the beginning is the beginning of the physical line in the file, not the logical line of the text.

Other considerations

Determining the optimal size of the index

The size of the index depends largely on the number of characters included besides the lookup character. To estimate the size, a set of dictionary files of 1 to 7 characters has been calculated for one volume. Additionaly, complete indices have been created for a dictionary of 3 and 7 characters length. A larger size means faster response time and an immediate KWIC-like view. If this is not done and a KWIC is needed, it would have to be constructed on the fly, which is more timeconsumin.

One volume has approx. 1 500 000 characters.

In strings of different length, they are approximately distributed as follows:

Length of entry	Lines in index	Size of dictionary (1 Vol)	Size of index (28 Vols)
1	3814	16 487
2	174208	1 049 942
3	575164	4 610 411	313 499378
4	838242	8 396 379
5	970633	11 666 525
6	1034526	14 507 096
7	1074590	17 221 803	916 231744
Total	1504517	--