Document Analysis


What is  Document Analysis?

Document analysis includes:

  • gathering information used in a formal description of the electronic text
  • studying the content and structure of the documents:
    • identifying and naming the components of some class of documents
    • specifying their interrelationships
    • naming their properties

No serious project to produce electronic texts should try to skip the document analysis phase.

Why is Document Analysis Essential?

Because you must know:

  • if you use the TEI encoding scheme, which parts of the TEI do you need?
  • if you extend the TEI scheme, what must you add?
  • if you design your own SGML markup, what needs to be in the document type definition?
  • if you don't use SGML, what aspects of your documents will you mark, and how?

Steps in Document Analysis

  1. Define the environment:
    • your requirements
    • external requirements
    • the document universe
    • the set of document types.
  2. Define the textual features you care about.
  3. Identify the relationships among the features.
  4. Enrich the collection of text features.

Step 1a: Define Your Goals

  • goals and objectives
  • scope
  • internal and external constraints
  • intended and foreseeable uses of information
  • functions of the SGML/XML application:
    • publishing (paper, CD-ROM, network)
    • database retrieval
    • hypertext navigation
    • electronic review and comment
    • document interchange
    • etc., etc.

Step 1b: Identify Relevant Standards

Step 1c: Document Universe

What documents are you talking about?

  • what is, or what should be?
  • many similar items (Taisho Tripitaka, News)?
  • or one unique item (the Oxford English Dictionary)?
  • what information in which documents?
  • how many different kinds of documents?

Step 1d: Gathering Samples

Construct a set of samples to analyze, including:

  • typical samples as well as special cases:
    • typical
    • unusual but within bounds
    • off the wall
  • short examples and long ones
  • current documents and old ones
  • not just printed samples
  • all the parts, all associated documents

Step 2: Define Features

What is a feature?

  • large structural units (table of contents, body, front matter, chapter, ...)
  • smaller structural units within the larger (headings, figures, lists, ...)
  • non-structural units conveying specialized information: (italics, hypertext links, people's names, dates, topical keywords, technical terms, ...)

How Big is a Feature?

As large or small as it needs to be. A feature might be:

  • the entire dictionary
  • a section of the dictionary
  • an entry in the dictionary
  • the head-word in the dictionary
  • a syllable within the head-word

Not all features are visible in the output: status information, internal editorial notes (one editor to the other: ‘How can you SAY that?’)

Principles of Feature Definition

Something is a good candidate for definition as a feature if:

  • it looks different from the rest of the text
  • it requires different processing
  • you may want to find it easily later
  • you may want to point at it from elsewhere
  • it is a nameable part of the structure of the text (chapter, note, quotation, ...)
  • it fills a clear function in the hierarchy of the text
  • it is information of a specialized and interesting type

How Many Features?

Academica Sinica, 128, Sec.2, Yanjiuyuan lu, Nangang, Taipei

  • One feature: name-and-address.
  • Two features: organization-name and organization-address.
  • Three features: name-and-address, which contains organization-name and organization-address.

How Many features?

Academica Sinica, 128, Sec. 2, Yanjiuyuan lu, Nangang, Taipei

  • Eight features: name-and-address, organization-name, street-address, house-number, street-section, street-name, district, city.
  • Fifteen features: nine word and six punctuation.
  • Five features: address, name (type=organization), three address-line.

Choosing a Feature Analysis

The analysis of a sample text should ideally be:

  • true
  • useful
  • simple enough to use

These features do not always co-exist.

In case of doubt, choose truth over an apparently useful lie.

Why Identify features at All?

So you can:

  • put all the technical terms into the draft glossary
  • print all personal names in blue, all names of places in green
  • find all occurrences of the word hypertext, but only in section headings, not in footnotes
  • ensure that all announcements of public events specify the date, time, location, and sponsor of the event, as well as giving a description
  • automatically maintain cross references and indices
  • create links in hypertext applications

Step 3: Identify Relationships

  • hierarchy of containers (a part contains chapters, which contain sections)
  • sequence (front matter precedes body, which precedes back matter)
  • alternation: either A or B, but not both
  • occurrence (occurs once? many times? optional?)
  • semantic groups: collections of similar things
  • syntactic groups: items which can appear in similar places

Step 4: Enrich the Collection

  • non-printing information: bibliographic and cataloging data, subject keywords, identity of encoder, circumstances of production
  • control information: status tracking, routing and process control information, confidentiality
  • gaps in semantic groups: we have three reasons something might be italic; are there more?
  • gaps in syntactic groups: what else can occur at the beginning of a new chapter?

Knowing When to Stop

What is `enough' document analysis? How can you tell you're done?

  • a place for everything and everything in its place
  • tag a sample: can you tag everything you see?
  • have you identified everything you'll want to point at, search for, search within, sort by, or process specially?
  • is the set of features good enough to go on with?
  • does this feature set provide a good foundation for later growth and change?

What is Good Enough?

Don't over-stress the applications you foresee: in the future, other applications and problems will take their place. But it's not enough for markup to be useful eventually: it can and should be made useful now:

  • can you search it acceptably?
  • can you process it acceptably?
  • can you display it on screen?
  • can you print it?
  • can you produce this level of markup with current procedures?

Document Analysis: Conclusions

Document analysis forces you to:

  • clarify your needs and interests
  • identify clearly the textual features of critical importance for your work

It thus prepares you to:

  • identify TEI tags you will need
  • identify desirable or necessary extensions to the TEI encoding scheme
  • define new SGML/XML elements if necessary

2 Next | First| Previous Introduction to XML, Markup and the TEI Guidelines