Problems in SGML

for Research


C. M. Sperberg-McQueen,
University of Illinois at Chicago

15 November 1988

1. What is the TEI

1.1. 1.1 Origins

Origins in the literary-and-linguistic computing community. Great interest immediately in other areas: computational linguistics is in a stage of massive db development and concern for reusability of data across theoretical boundaries.
Those affected by the project are
  • researchers (esp. humanists but also computational linguists)
  • publishers and industry
  • software developers
  • data archivists
  • funding agencies

1.2. Goals

  1. It should specify a common interchange format for machine readable texts.
  2. It should provide a set of recommendations for encoding new textual materials.
  3. It should document the major existing encoding schemes, and investigate the feasibility of developing a metalanguage in which to describe them.
  4. It must be a set of guidelines, not a set of rigid requirements.
  5. It must be extensible.
  6. It should be device- and software-independent.
  7. It should be language-independent.
  8. It should be application-independent.
We are conscious of a number of tradeoffs:
  • in standardizing notation one risks standardizing the thinking
  • there's a long way from the classicist in the garret to the multi-million-dollar machine-translation project. We have to keep things simple for the poor scholar, expressive for the team with programmers to spare.
  • we want rigorously defined standards, but they should be clear and expressive. (Enough rigor will render anything unreadable.)

1.3. 1.2 Organization

Sponsorship by ACH, ALLC, ACL.
Funding is from NEH, EEC, Mellon.
Participation by 15 other organizations.
Steering Committee, Advisory Board, Editors, Working Committees.

2. Markup Problems

2.1. General problems

  1. We need a single interchange vocabulary
  2. Esoteric applications require late binding, revised DTDs
  3. Can you write an algorithm-independent data structure?
  4. SGML poses thorny design issues

2.2. SGML problems

  1. SGML wants a single hierarchy. But we have more than one. We don't even know how many we have.
  2. SGML wants a hierarchy. But we need to be able to mark “the passage that seems to echo Vergil” and “the passage the beer spilled on” at the same time.
  3. SGML wants descriptive markup. But we don't always know why the presentation is what it is.
  4. SGML wants a DTD. But it isn't clear that a finite number of DTDs will handle documents of the range we are talking about. Anyone today will put the tag dedication in the frontmatter section of a DTD, and only there. They'd be wrong, for 17th century verse. The collected works of 17th century poets may have a dedication for the volume, dedication of the individual books or sections, dedications of individual cycles or groups in those books, and dedications of individual poems.
  5. SGML is aimed at document production. In some places, you can say “If the document conflicts with the DTD, then there is an error in the document.” With historical texts, the situation is more likely to be: “If the document conflicts with the DTD, then there is an error in the DTD.”

3. 3. Why Should Industry Care about the TEI?

Why should you care about this? Well, in the SGML revolution, the research community are the Jacobins or the Bolsheviks. SGML attempts the liberation of electronic texts from paper output. But it takes a while to shake your thoughts free. But the research community has never been fixated on ink on paper: texts have always appeared to researchers as complex multi-leveled cultural and linguistic objects that exhibited a lot of regularity but also a tremendous variety of form.
Also, whether it's obvious or not: our problems are your problems, and your problems are our problems. Most industrial firms do not much care about the textual criticism of the First Folio, but they do face serious problems of version control -- which take the same form for text applications. You may not care about literary allusion, but subject indexing has many of the same problems. You may not care about the problems of theoretical diversity, but the same problems arise in trying to mediate among conflicting models in page description languages.