What is Document
Document analysis includes:
- gathering information used in a formal description of the
- studying the content and structure of the documents:
- identifying and naming the components of some class of
- specifying their interrelationships
- naming their
serious project to produce electronic texts should try to skip
the document analysis phase.
Why is Document Analysis Essential?
Because you must know:
- if you use the TEI encoding scheme, which parts of the TEI
do you need?
- if you extend the TEI scheme, what must you add?
- if you design your own SGML markup, what needs to be in
the document type definition?
- if you don't use SGML, what aspects of your documents will
you mark, and how?
Steps in Document Analysis
- Define the environment:
- your requirements
- external requirements
- the document universe
- the set of document types.
- Define the textual features you care about.
- Identify the relationships among the features.
- Enrich the collection of text
Step 1a: Define Your Goals
- goals and objectives
- internal and external constraints
- intended and foreseeable uses of information
- functions of the SGML/XML application:
- publishing (paper, CD-ROM, network)
- database retrieval
- hypertext navigation
- electronic review and comment
- document interchange
- etc., etc.
Step 1b: Identify Relevant
Step 1c: Document Universe
What documents are you talking about?
- what is, or what should
- many similar items (Taisho Tripitaka, News)?
- or one unique item (the Oxford English Dictionary)?
- what information in which documents?
- how many different kinds of documents?
Step 1d: Gathering Samples
Construct a set of samples to analyze, including:
- typical samples as well as special cases:
- unusual but within bounds
- off the wall
- short examples and long ones
- current documents and old ones
- not just printed samples
- all the parts, all associated
Step 2: Define Features
What is a feature?
- large structural units (table of contents, body, front
matter, chapter, ...)
- smaller structural units within the larger (headings,
figures, lists, ...)
- non-structural units conveying specialized information:
(italics, hypertext links, people's names, dates, topical
keywords, technical terms, ...)
How Big is a Feature?
As large or small as it needs to be. A feature might
- the entire dictionary
- a section of the dictionary
- an entry in the dictionary
- the head-word in the dictionary
- a syllable within the head-word
Not all features are visible in the output: status
information, internal editorial notes (one editor to the other:
‘How can you SAY
Principles of Feature Definition
Something is a good candidate for definition as a feature
- it looks different from the rest of the text
- it requires different processing
- you may want to find it easily later
- you may want to point at it from elsewhere
- it is a nameable part of the structure of the text
(chapter, note, quotation, ...)
- it fills a clear function in the hierarchy of the
- it is information of a specialized and interesting
How Many Features?
Academica Sinica, 128, Sec.2, Yanjiuyuan lu, Nangang, Taipei
- One feature:
- Two features: organization-name and
- Three features: name-and-address,
which contains organization-name and
How Many features?
Academica Sinica, 128, Sec. 2, Yanjiuyuan lu, Nangang, Taipei
- Eight features:
- Fifteen features: nine word and
- Five features: address,
name (type=organization), three
Choosing a Feature Analysis
The analysis of a sample text should ideally be:
- simple enough to use
These features do
not always co-exist.
In case of doubt, choose truth over an apparently useful
Why Identify features at All?
So you can:
- put all the technical terms into the draft glossary
- print all personal names in blue, all names of places in
- find all occurrences of the word
hypertext, but only in section headings,
not in footnotes
- ensure that all announcements of public events specify the
date, time, location, and sponsor of the event, as well as
giving a description
- automatically maintain cross references and indices
- create links in hypertext applications
Step 3: Identify Relationships
- hierarchy of containers (a part contains chapters, which
- sequence (front matter precedes body, which precedes back
- alternation: either A or B, but not both
- occurrence (occurs once? many times? optional?)
- semantic groups: collections of similar things
- syntactic groups: items which can appear in similar
Step 4: Enrich the Collection
- non-printing information: bibliographic and cataloging
data, subject keywords, identity of encoder, circumstances of
- control information: status tracking, routing and process
control information, confidentiality
- gaps in semantic groups: we have three reasons something
might be italic; are there more?
- gaps in syntactic groups: what else can
occur at the beginning of a new chapter?
Knowing When to Stop
What is `enough' document analysis? How
can you tell you're done?
- a place for everything and everything in its place
- tag a sample: can you tag everything you see?
- have you identified everything you'll want to point at,
search for, search within, sort by, or process
- is the set of features good enough to go on with?
- does this feature set provide a good foundation for later
growth and change?
What is Good Enough?
Don't over-stress the applications you foresee: in the
future, other applications and problems will take their place.
But it's not enough for markup to be useful
eventually: it can and should be made useful
- can you search it acceptably?
- can you process it acceptably?
- can you display it on screen?
- can you print it?
- can you produce this level of markup with current
Document Analysis: Conclusions
Document analysis forces you to:
- clarify your needs and interests
- identify clearly the textual features of critical
importance for your work
It thus prepares you to:
- identify TEI tags you will need
- identify desirable or necessary extensions to the TEI
- define new SGML/XML elements if