Recommendations for revision of the TEI Guidelines, Chapter 4

1. 4 Characters and Character Sets

1.1. Introduction

Computer systems vary greatly in the sets of characters they make available for use in electronic documents; this variety enables users with widely different needs to find computer systems suitable to their work, but it also complicates the interchange of documents among systems; hence the need for a chapter on this topic in these Guidelines. By the 1980's, it became clear that one unified character set, that could accomodate all languages and scripts used in the world would be desirable, and with the increased capacity of computers, also feasable. These efforts led to the forming of the Unicode Consortium and the development of the Unicode Standard, which was later synchronized with the International Organization for Standardization (ISO) to ensure there would in fact be only one universal encoding.

For XML-compliant TEI texts, the underlying character set is defined to be Unicode, although the texts can be encoded in any of the existing character sets, that are a proper subset of Unicode and have a normative mapping to Unicode. This chapter will briefly describe Unicode, discusses how to use Unicode in text encoding projects and also give some recommendations on whether or not to use other encodings.

The XML recommendation does recommend a XML declaration at the beginning of XML documents. A XML declaration can optionally also contain a encoding declaration, which can specify values like "UTF-8", "UTF-16", "ISO-10646-UCS-2" and "ISO-10646-UCS-4" for the various encodings of Unicode or ISO/IEC 10646. Character sets registered with the Internet Assigned Numbers Authority (IANA) should use the registered names. These Guidelines do not recommended to use other character sets, although the XML recommendation does allow so. If other character sets are used, it recommends to use encoding names starting with "x-". The same conventions hold also for auxiliary documents like external entities. Together with an encoding declaration, the XML declaration takes the following form:

1.2. What is Unicode?

Unicode aims at covering all characters used in the world for written text, thus enabling the text encoder to choose from a extremely large reportoire of characters. While this situation is certainly an improvement over the character sets used previously, which where usually limited to a small number of languages that could be written with it, the encoder now faces the problem of how to determine the appropriate character she should use.

Texts are usually made up of distinct visible units commonly called characters. In information processing circles, however, it has become customary not to call them characters, but `glyph images', while the term character is reserved for the abstract notion of `the smallest unit that carries semantic value'¹ and is rendered graphically with one of a range of different glyph shapes.

This disctinction introduces an additional layer of abstraction between the letters seen on a printed document or manuscript and the way they are digitally represented in computer files. Over the years, many different ways to encode these units have been proposed by national standard bodies and vendors of information processing systems. As pointed out in 2.9.1 The SGML Declaration , the character set used in a SGML document can be arbitrarily defined in the SGML declaration. XML documents however all share the same SGML declaration, which specifies ISO/IEC 10646 ²as the document character set³. Since the principles of collection, selection and alignment of characters in different character sets differ considerably, no general recommendations can be given. We will however discuss some of the implications the usage of Unicode has, since this is relevant to all XML documents and those SGML documents that choose to use Unicode as the document character set. For documents using other character sets, see below 1.8. Other character sets.

Unicode has attempted to encode `abstract characters' independent of specific glyph forms. The glyph examples given in Unicode code charts are not normative and just one of possibly many examples; they are only given to indicate which character is intended. Any user of the Unicode standard has to realize that there is no way to ensure that a given character encoded with a given Unicode codepoint will be rendered in a similar way on other computer systems. The information of which glyph from which glyph collection is to be used, is not encoded and is therefore also not available for any rendering process.

The code space for Unicode characters currently has room for far more than 1 million characters. These characters are organized in 17 planes, each of which holds up to 65536 codepoints. Only the first of these planes, the `Basic Multilingual Plane' (BMP) can be addressed using a 16 bit integer, all other planes require greater values. The remaining 16 planes are, however, adressable using two 16 bit integers from a range of 16 bit value pairs, `surrogates', set aside especially for this purpose. Unicode 3.1 has assigned codepoints beyond the BMP for the first time, but most characters can be addressed using just the BMP.

Some basic information about characters, character sets and issues to watch out is provided in this chapter. To avoid confusion and misunderstandings, a definition of some relevant terms will given. Following is a short introduction to some basic concepts of the Unicode standard, most importantly the Character/Glyph model. We will then proceed to give some advice on how Unicode should be used in practice and what problems we think could arise.

A note on notation: Wherever possible, a Unicode codepoint mentioned in the text is first shown with an example glyph, then given the numeric value of its codepoint in hexadecimal notation, preceded by "U+", and finally the canonical name according to the Unicode Standard for this character is given, for example:

ä U+00E4 LATIN SMALL LETTER A WITH DIAERESIS

1.2.1. Some definitions

Character
1. A character is an atomic unit of text as specified by ISO/IEC 10646.
2. Synonym of `abstract character', e.g. it refers to the abstract meaning and/or shape, rather than a specific shape (glyph). See definition D3 in Section 3.3, Characters and Coded Representations of The Unicode Standard Version 3.0.
One character can be visually rendered by a number of different glyphs. It is also important to note that one character can be encoded using a sequence of Unicode codepoints.
Character set A collection of characters (basic elements) used to represent textual information.
Glyph The visible shape of a character. It is sometimes considered to be an abstract shape, with a corresponding `glyph image' as concrete instantiation of that glyph. For the present purpose, this distinction is not relevant.
Variant character The term variant character is frequently used in cases where variant glyph would be more appropriate. We will not use it here.
Precomposed character A character that is equivalent to a sequence of one or more other characters, according to the decomposition mappings found in the Unicode character database. Precomposed characters are also called `composite characters' or `decomposable characters'. See also `compatibility character' below. An example for this is the character ä (U+00E4) mentioned above, which can be decomposed to the character sequenceU+0061 (LATIN SMALL LETTER A) + U+0308 (COMBINING DIAERESIS).
Variant glyph A different graphical representation for the same abstract character.
Presentation form In some scripts, different glyphs are used to represent one character depending on contextual circumstances.
Font A font is a collection of specific glyphs, frequently in a particular family and style. It should not be mixed up with character set, character encoding or other similar terms.
Compatibility character In the Unicode Standard, a compatibility character is defined as follows: ‘
1. A character encoded only for compatibility with preexisting character encoding standards to support transcoding.
2. A character that has a `compatibility decomposition'. (See Definition D21 in Section 3.6, Decomposition .)
’ Many `precomposed characters' are also examples of `compatibility characters', which include accented characters. The existence of a `compatibility decomposition' is indicated in the Unicode Character Database.
NormalizationTransformation of data to a standardized (normal) form. This frequently means selecting one specific code value out of several possible ones and to do this consequently.

These terms do not completely match with some similar terms frequently used in linguistic theory (See for example R. R. K. Hartmann and F.C. Stork: Dictionary of language and linguistics, Applied Science Publishers Ltd.,London.,1976.)

Grapheme: A minimally distinctive unit of a particular writing system. The different variants, e.g., the cursive and printed shapes of letters M, m, cursivated m, M, etc. in an alphabetic writings system are all allographs of the grapheme /m/.
Allograph: One of a group of variants of a grapheme or written sign in a particular writing system. It usually refers to different shapes of letters and punctuation marks, e.g., lower case, capital, cursive, printed, strokes.

As can be seen, `glyph' and `allograph', as well as `grapheme' and `abstract character' are clearly related, partly overlapping concepts. The difference is that the grapheme concept is defined in relation to a particular writing system, whereas the concept of abstract character is defined independently of any specific writing system.

1.2.2. The Unicode Standard

The Unicode Standard⁴ defines the universal character set. Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its third major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters.

It is highly recommended for users of these Guidelines to make themselves familiar with the general principles of the Unicode Standard, as spelled out in The Unicode Standard Version 3.0, Chapter 2. Additionally, there is an excellent document co-published by the World Wide Web Consortium and the Unicode Consortium, Unicode in XML and other Markup Languages.⁵ This document gives some general considerations, discusses the suitability for use in the context of markup languages of certain code values or code ranges and finally gives specific recommendations on what to do with characters with compatibility mappings. All this is of great relevance to text encoding as described in these Guidelines and should be applied where applicable.

1.3. Characters and glyphs

1.3.1. The Character / Glyph model

As outlined above, the distinction of characters and glyphs is crucial. A coherent model of the relationship between characters and glyphs has been developed within the Unicode Consortium and the ISO working group SC2⁶. It is expected to form the base for future standard work.

According to this model, what we see as a character on paper belongs actually to two different realms:

The content, i.e. its meaning and phonetic value
The graphical appearance

The requirements of these two realms in text processing are quite different. Searching for information, for example, operates on the content domain, usually with little attention paid to appearance of characters. A layout process on the other hand, has little to do with the content, but needs to be concerned with the exact appearance of characters. Of course, some operations require knowledge in both domains, such as a hyphenation process. Text encoding usually operates in the first realm, but in some cases it is desirable to find ways to encode the second as well.

1.3.2. Implications of the Character/Glyph model for text encoding

Since text encoding is not too much concerned with the appearance of the text on the presentation level, it is sufficient to select the correct codepoint according to the abstract character. There are however cases, where the text encoder might want to actually preserve the information about which concrete glyph was used.

In principle, there are two different levels where such an information might be encoded:

On the level of character encoding, e.g. with Unicode code points.
On the markup level, with appropriate elements and/or attributes.

Currently, neither the Unicode Standard nor these Guidelines offer specifications to actually encode glyph variations. Work is under way in both areas to improve this situation, better solutions are expected in future versions of the relevant standards.

1.4. Character semantics

The Unicode Consortium maintains a database of additional character semantics. This includes the value of the codepoint, a character name, and normative properties. This database is an important reference in determining, which Unicode codepoint to use to encode a certain character. The character properties database contains also information among others, about case, numeric value and directionality and, as mentioned above, about the status of a certain character as `compatibility character'⁷. When looking up Unicode characters for use in text encoding, it is highly recommended to consult this database to make sure that the character used is the one with the desired properties.

1.5. Normalization

As has been discussed, there are a number of characters that have more than one encoding in Unicode. In some cases, this is the case due to the existence of compatibility characters, in other cases, there are simply different possibilities to express the same character, as in characters composed of a base character and some diacritical marks. Unicode does contain quite a number of `precomposed characters', mostly because they were already encoded in earlier standards that served as sources for Unicode, but in principle no additional precomposed characters are to be added. A sequence of a base character and one or more diacritical marks is supposed to be equivalent to the corresponding precomposed character.

In text encoding projects, it is important to decide and document (i.e. in the <teiHeader>) which of these different forms are to be used in a document. It is very important to standardize consequently on one form, in order to ensure the integrity of the data. The Unicode Consortium provides four standard normalization forms⁸, of which for text encoding the NFC seems to be most appropriate. The World Wide Web Consortium has produced a document entitled Character Model for the World Wide Web 1.0⁹, which among other things outlines some principles of normalization. In general, normalization to the shortest possible Unicode encoding is recommended to be used.

1.6. Characters from the Private Usage Area

Although Unicode now contains far more than 90000 characters, there is always the possibility that characters needed for text encoding are not defined in Unicode. Some of these may be presentation forms or alternate writing styles of East Asian characters that do not qualify to be included in Unicode. For such characters, Unicode provides a `Private Usage Area', which is reserved for use of vendors, private groups and individuals. There are 6400 codepoints in this area in the BMP and 131068 in other planes.

The use of codepoints from this range in TEI documents for interchange is strongly discouraged! It is recommended to use other mechanisms, for example entity references and supplemental documentation in the WSD for such characters.

For local processing, on the other hand, use of characters from this area might prove convenient, since, if the corresponding font ressources are available, text encoders can see the characters more easily on their screens and analytical software might not be able to process entity references in the same way as characters. In any case, before preparing a TEI document for interchange, all occurrences of characters from the Private Usage Area should be removed.

1.7. XML:lang and the TEI lang attribute

Both TEI and XMl offer ways to identify the natural language of documents. In TEI, the global `lang' attribute is defined with the content type of IDREF, which points to a <language> element in the <langUsage> section of the<teiHeader>. The ID attribute of the <language> element specifies the identifier for the `Writing System Declaration' (WSD). This will usually be a two letter code as defined by ISO-639:1988 or a three letter code of the language as defined in ISO 639-2:1998¹⁰, in which case it should match the value of the `iso639' attribute of the <language> element in the WSD¹¹. If there is no applicable language code in the ISO 639 family of standards, other identifiers can be used, for example identifiers from the SIL Ethnologue database ¹². Usually the WSD will be in an external file and the `wsd' attribute points to the entity, which contains the WSD. The scope of the lang attribute in TEI is the text content of the element it is declared on and all contained elements, that do not have a lang attribute.

It should be noted that in these Guidelines through the WSD the `lang' attribute identifies both the language and the writing system employed for that language. The WSD mechanism is scheduled for revision in a later version of these Guidelines, which might change this behaviour.

In XML, the attribute xml:lang, the value of which is the identifier of a language from ISO 639 or registered with IANA. The scope, according to the XML recommendation, is:

Since the TEI DTD defines a great number of CDATA attributes with predeclared content in English, xml:lang can not be applied as intended in the XML recommendation. It is therefore as of this version of these Guidelines, not recommended to rely on the xml:lang attribute, instead, these Guidelines recommend to continue to use the existing language identification mechanism in TEI.

1.8. Other character sets

There are cases, where Unicode can not be used for the encoding of texts. This might be because the processing software does not support it, or for some other reasons. In SGML, the document character set can be freely chosen and a corresponding SGML declaration must be supplied.

The following problems related to character sets need to be considered by the encoder of electronic texts:

selecting a character set to use in creating, processing, or storing the electronic text
encoding characters which are not provided in this document character set
preparing documents for interchange so that the characters within them are not corrupted in transit

No single character set is prescribed for use in TEI-encoded documents. Users may use any character set available to them, subject to the character set restrictions imposed by the SGML declaration. It is recommended that the character set used be documented by one or more `writing system declarations' (WSD), on which see below. In most cases, a predefined writing system declaration should be suitable. For writing system declarations provided as part of the TEI Guidelines, see chapter here.

In general, it is most convenient to use a character set readily available on one's computer system, though for special purposes it may be preferable to customize the character set using software specialized for the purpose.¹³ Whether to use the usual character set or create a custom set depends on the documents being encoded, the staff support and tools available for customizing the character set, the user's technical facility, etc. A choice must be made between the perceived relative convenience of living with an existing character set and the effort of modifying it to suit one's documents more closely. Where a suitable character set is defined by a national or international standard, local customization should implement the standard rather than inventing yet another non-standard character set. The choice must be made by each individual according to individual circumstances; no general recommendations are made here as to whether locally customized character sets should be used. In principle, however, for local processing, encoders should use whatever character set they find convenient.

1.8.1. Local Character Sets

When the characters in a text exist in the local character set, the appropriate character codes should be used to represent them. Virtually all computer systems provide at least the 52 letters of the standard Latin alphabet, the ten decimal digits, and some basic punctuation marks.

Other characters, such as Latin characters with diacritics (e.g. ä or é) or non-Latin characters (e.g. Greek, Hebrew, Arabic, Cyrillic) are less universally provided. East Asian scripts pose particular problems because of the size of their character repertoires. If the local character set provides an `ä' however, there is normally no reason not to use it where that character appears in the text, unless the electronic text is to be moved frequently among machines, in which case one may wish to restrict the electronic text to characters known to translate well among machines. (For more information on moving characters among machines, see section 1.8.3. Character Set Problems in Interchange below.)

As noted above, full use of a local character set may require that the SGML declaration be modified to define all the characters used as legal SGML characters.

Characters not available in the local character set should usually be encoded using SGML entity references. In SGML terms, an entity (described in more detail in section here) is any named string of characters, from a single character to an entire system file. Entities are included in SGML documents by entity references. Lists of suggested names for all the characters and printers' symbols used by most modern European languages have been published by ISO and others. ¹⁴

For example, the standard entity name for the character ä is auml; a reference to the entity gives the name of the entity, preceded by an ampersand and followed by a semicolon. ¹⁵ Consider the following German sentence:

Trotz dieser langen Tradition sekundäranalytischer Ansätze wird man die Bilanz tatsächlich durchgeführter sekundäranalytischer Arbeiten aber in wesentlichen Punkten als unbefriedigend empfinden müssen.

Using the standard names for a-umlaut and u-umlaut, one could transcribe this sentence thus:

  Trotz dieser langen Tradition sekund&auml;ranalytischer
	Ans&auml;tze wird man die Bilanz tats&auml;chlich durchgef&uuml;hrter
  sekund&auml;ranalytischer Arbeiten aber in wesentlichen Punkten als
  unbefriedigend empfinden m&uuml;ssen.

As noted above, standard entity names have been defined for most characters used by languages written in the Latin alphabet, and for some other scripts, including Greek, Cyrillic, Coptic, and the International Phonetic Alphabet. A useful subset of these may be found in chapter here.

Before an entity can be referred to, it must be declared. Standard public entity names can be declared en masse, by including within the DTD subset of the SGML document a reference to the standard public entity which declares them. The German document quoted above, for example, would have the following lines, or their equivalent, in its DTD subset:

  <!ENTITY % ISOLat1
	PUBLIC "ISO 8879-1986//ENTITIES Added Latin 1//EN">
	%ISOLat1;

Where no standard entity name exists, or where the standard name is felt unsuitable for some reason, the encoder may declare non-standard entities, using the normal SGML syntax. The replacement string for such entities may vary for different purposes, as in the following extended example.

In transcribing a manuscript, it might be desirable to distinguish among three distinct forms of the letter r. In the transcript, each of these forms will be encoded by an entity reference, for example: &r;, &r2;, and &r3;. Entity declarations must then be provided, within the DTD subset of the SGML document, to define these entities and specify a substitute string.

One possible set of declarations would be as follows:


 <!ENTITY r  'r[1]'><!-- most common form of 'r'  -->
 <!ENTITY r2 'r[2]'><!-- secondary form of 'r'    -->
 <!ENTITY r3 'r[3]'><!-- third form of 'r'        -->

The expansions shown above will simply flag each occurrence with a number in brackets to indicate which form of r appears in the manuscript. If the file is processed in a setting where graphics can be used, for example in print or in publications on the World Wide Web, it will be possible to show graphical renderings of the different types of r used in the manuscript, by defining appropriate expansions for the entities, as follows:


 <!ENTITY r  SYSTEM "r1.gif" NDATA GIF ><!-- most common form of 'r'  -->
 <!ENTITY r2 SYSTEM "r2.gif" NDATA GIF><!-- secondary form of 'r'    -->
 <!ENTITY r3 SYSTEM "r3.gif" NDATA GIF><!-- third form of 'r'        -->

1.8.2. Shifting Among Character Sets

Many documents contain material from more than one language: loan words, quotations from foreign languages, etc. Since languages use a variety of writing systems, which in turn use a variety of character repertoires, shifts in language frequently go hand in hand with shifts in character repertoire and writing system. Since language change is frequently of importance in processing a document, even when no character set shift is needed, the encoding scheme defined here provides a global attribute `lang' to make it possible to mark language shifts explicitly.

Depending on the definitions in the Writing System Declaration, this shift might also imply a shift among character sets. In practice, editors that support character set shifts are extremely rare, which makes these shifts difficult to use. In many cases, the shift is not really to a different character set, but merely to a different font, that puts different characters in existing slots, thus overloading the same characters with a different meaning. While this works well on the presentation level and allows proper rendering of the text, this poses problems in operations on the information content, for example string searches. Such searches might give wrong results, since the application software might not be aware that identical codepoints in different font are intended to represent different characters.

However such a shift might be achieved in practice, data files that contain text in different character sets are extremely difficult to manage. It is thus adivisable to consider other possibilities carefully, before starting to encode in this way. If the occurrences of a secondary character set are only limited to a few short quotations, it might be better to entirely use entity references. If there are considerable passages, it might be worth looking for software that can accomodate a superset of both character sets, for example Unicode.

Some languages use more than one writing system. For example, some Slavic languages may be written either in the Latin or in the Cyrillic alphabet; some Turkic languages in Cyrillic, Latin, or Arabic script. In such cases, each writing system must be treated separately, as if it where a separate `language'. Each distinct value of the `lang' attribute, therefore, represents the combination of a single natural language, a single writing system, and a single coded character set.

Each value used for the `lang' attribute must correspond to a writing system declaration suitable for the language, character set and writing system being used. The values should where possible be taken from the standard two- and three-letter language codes defined by the international standard ISO 639.¹⁶ When more than one writing system is used for the same language in the same document, suffixes should be added to the values from ISO 639, for example using the two letter forms, zh-big5 could be used for Chinese encoded in the Big5 character set and ja-euc for Japanese in the EUC encoding. With the three letter forms of ISO 639-2, this would become zho-big5(or chi-big5) and jpn-euc. A selection of standard language codes, as well as a number of standard writing system declarations, are provided in chapter here; other writing system declarations may be provided locally and should accompany the encoded texts in interchange.

Like any global attribute, the `lang' attribute may be used on any element in the SGML document. To mark a technical term, for example, as being in a particular language, one may simply specify the appropriate language on the <term> tag (for which see here):

  <p lang="EN"> ... But then only will there be good ground
of hope for the further advance of knowledge, when
there shall be received and gathered together into
natural history a variety of experiments, which
are of no use in themselves, but simply serve to
discover causes and axioms, which I call <term
lang="LA">Experimenta lucifera</term>, experiments
of <term>light</term>, to distinguish them from
those which I call <term
lang="LA">fructifera</term>, experiments of
<term>fruit</term>. <p>Now experiments of this
kind have one admirable property and condition:
they never miss or fail. ...

The form in which materials in different writing systems are processed or displayed is not specified by these Guidelines; if appropriate characters are not available locally, application software may choose to display the material in an appropriate transliteration, as a series of entity references, or in other forms. If the local system requires explicit escape sequences or locking-shift control functions to signal shifts to alternate character sets, may be supplied by the application software upon recognition of the language shift. It is not recommended to embed these escape sequences directly in the content of the appropriate element. However, escape sequences are vulnerable to loss or misinterpretation by software that does not understand them. For this reason, the method of embedding them directly in the document, while allowed within TEI-conformant interchange, is not recommended for general use.¹⁷

1.8.3. Character Set Problems in Interchange

Electronic texts may be exchanged over electronic networks, through exchange of magnetical or optical media (e.g. disk or tape or cdrom), or by other means. In every case except the transmission of storage media from one machine to another machine of the same hardware type running the same operating system of the same version with the same locale setting and using the same coded character set, the characters are subject to translation and interpretation, and hence to misinterpretation and corruption, by utility software working somewhere on the interchange path. Network gateways, tape-reading software, and disk utilities routinely translate from one character set to another before passing the data on. If the utility errs in identifying the character set, or if several utilities translate back and forth among character sets using non-reversible translations, the chances are good that characters will be garbled and information lost.

As of today, most interchange of documents on networks is done in `binary' mode, which leaves the content of the exchanged files intact. While this usually also means that the line-ending conventions, that differ on different operating systems, get not adapted, this is hardly a problem, especially with SGML or XML documents, where this information can be given with markup constructs.

More information about interchanging documents is given in 30 Rules for Interchange.

1.9. The Writing System Declaration

Each language and writing system used in a TEI-conformant document should be documented in a Writing System Declaration (WSD), which specifies:

a formal name for the writing system and language, for use as a `lang' attribute value on elements encoded in it
a specification for the meaning of each character available in the writing system

The characters available in a writing system may be specified in the WSD for that writing system in one or more of the following ways:

by reference to an international, national, or TEI-registered coded character set or entity set
by reference to such a standard followed by formal declaration of all exceptions
by providing a formal declaration for each character used

Individual characters within a WSD are formally declared, where necessary, by providing the following information:

the unique code used to represent the character
special properties such as whether the character is a diacritic mark or not
brief textual description of the character
standard or local entity name used for the character in interchange
other standard identifiers for the character, if available, such as its code in the `Universal Character Set' of ISO 10646 or Unicode.
optionally, some specification of a suitable graphic rendition for the character in a suitable notation (e.g. graphic image, Metafont program, SVG fragment etc.)

The writing system declaration is one of a set of auxiliary documents which provide documentation relevant to the processing of TEI texts. Auxiliary documents are themselves SGML documents, for which document type declarations are provided. The DTD for the Writing System Declaration is discussed in detail in chapter here. Standard Writing System Declarations are provided in chapter here.

Notes

1. It should be noted, that this definition makes only sense within a specific writing system, which provides the framework within the existence or absence of differences can be established.

2. The character encoding standards defined by the ISO as `ISO/IEC 10646-1:2000' and the Unicode Consortium as the `Unicode Standard' are identical for most practical purposes. In all instances where ISO 10646 or Unicode will be used below, the other standard is meant to be included.

4. This section gives only a very short overview of those principles relevant to the current discussion. It is highly recommended to refer to the web site of the Unicode Consortium or peruse the book The Unicode Standard Version 3.0. This was the current major edition available in print at the time of this writing, and is used in references. Two minor revisions have been made to this, which are documented at the web site of the Unicode Consortium at http://www.unicode.org, the version number of the online edition is 3.1.1

5. This document was written by Martin Dürst and Asmus Freytag; at the time of this writing, Revision 5 (Dec.15, 2000) was available at http://www.unicode.org/unicode/reports/tr20/tr20-5.html and http://www.w3.org/TR/2000/NOTE-unicode-xml-20001215/.

6. See ISO/IEC 15285:1998 Information technology — An operational model for characters and glyphs.

7. It should be noted, that not all characters introduced for compatibility with other standards, are labeled as such in the database. See Unicode in XML and other Markup Languages , Section 4 for additional information.

8. see Unicode Technical Report #15 at http://www.unicode.org/unicode/reports/tr15/

9. Available at http://www.w3.org/TR/charmod, see section 4.2 Definitions for W3C Text Normalization.

10. Codes for the Representation of Names of Languages-Part 2: Alpha-3 Code, ([Geneva]: International Organization for Standardization, 1998).

11. The list of language codes is available from the registration authority for ISO 639-2, the Library of Congress in Washington, D.C. at http://lcweb.loc.gov/standards/iso639-2/langhome.html.

12. This database is accessible online at http://www.ethnologue.com/language_code_index.asp at the time of this writing.

13. Some operating systems allow the selection of character sets (sometimes together with the language used in the interaction with the user). On UNIX and GNU/Linux systems, this might be done by specifying an appropriate value for the `locale', on recent versions of MS Windows, this can be specified through the `Regional Options' in the Control Panel, while MS DOS allows the setting of the code page with the mode command. Separately from the character sets used by the operating systems processing environment, Applications sometimes also offer mechanisms to specify a character set on reading and/or writing files.

14. The most widely used such entity set is to be found in Annex D to ISO 8879; it is also reproduced or summarized in most SGML textbooks, notably Charles F. Goldfarb, The SGML Handbook (Oxford: Clarendon Press, 1990). A list of some frequently used standard entity names may be found in chapter here. Extensive entity sets are being developed by the TEI and others are being documented in the fascicles of ISO/TR 9573: Technical Report: Information processing --- SGML support facilities --- Techniques for using SGML ([Geneva]: ISO, 1988 et seq.).

15. Strictly speaking the semicolon is in SGML not always required; for details see any full treatment of SGML, it is however required for XML.

16. ISO 639: 1988, Code for the representation of names of languages ([Geneva]: International Organization for Standardization, 1988). The most recent version of this standard supplies three-letter codes as well as the earlier system of two-letter codes. Either may be used in TEI documents.

17. Standard methods for character-set shifting are defined in ISO 2022: 1986, Information processing --- ISO 7-bit and 8-bit coded character sets --- Code extension techniques, 3d ed. ([Geneva]: International Organization for Standardization, 1986). A standard set of control functions, including methods of specifying script direction (left-right, right-left, top-down, etc.), is defined by ISO 6429: 1992, Information processing --- Control functions for 7-bit and 8-bit coded character sets, ([Geneva]: ISO, 1992). These and related standards have seen some usage in East Asia, but they have rarely been implemented fully. As of today, they are largely superseded by the Unicode standard and ISO 10646.

Kyoto University, Institute for Research in Humanities