graphic with four colored squares
Cover page image (keys)

XML vocabulary design and specification

Using W3C XML Schema 1.0

C. M. Sperberg-McQueen

5 December 2007


Welcome and overview


* Asterisks in the slides mean that annotation or interpretation is needed; if I don't provide it, ask.


I'm indebted to many others:
  • Many individuals have discussed general problems with me, and many have provided specific material included here: Elaine Brennan, Robin Cover, David Fallside, Michael Hahn, Dave Hollander, Deborah A. Lapeyre, Eve L. Maler, Murray Maloney, Jeni Tennison, Henry S. Thompson, Tommie Usdin, Ann Wrightson
  • My colleagues at W3C and in the W3C XML Schema Working Group have educated me long and painfully.
  • The slides describing document analysis are adapted from slides developed by Mulberry Technologies; used by kind permission of Mulberry Technologies.

Workshop overview

(I) Markup and modeling

The essence of markup

  • Information has structure
  • Markup is used to make that structure explicit
  • XML makes structure explicit by using explicit delimiters on everything
What makes a marked-up document valuable is the added value that the markup brings. Without markup, the computer can't tell that This is Emphasized and This is a Book Title. A human can, and needs to let the computer know.
-Yuri Rubinsky, SGML on the Web (1997)

Markup and modeling

  • Telling the computer what's what
  • But wait — what is what?
  • What things exist? What kinds of things?
  • What relations exist among things?
To mark up our data, we must understand our world.
The best markup simply tells the* truth* about the data.

Modelling is hard

  • You can't model everything (but where to stop?)
  • The map is not the territory — and shouldn't be!
  • Markup is for communication
  • Different people may disagree

Modelling the world as we see it

Models will differ
  • Different names for ‘same’ thing
  • Different specificity
    Is this a licensing info a Constract? an Agreement? a License? a License Agreement? a set of Legal arrangements? a set of Conditions? a Policy?
  • Different views of who's involved
    “Bob buttered the bread.” “The bread was buttered.” “Bob buttered the bread with a knife.” “Bob buttered the bread with a knife using Land o' Lakes brand.” “At 12:23, Bob buttered the bread with a knife using Land o' Lakes brand.” “Event E247 is an event. The agent was Bob. The patient (object) was the bread. The instrument[1] was some butter (B231). B231 is Land o' Lakes brand butter. The instrument[2] was a knife.”

The essence of vocabulary design

Should you design a language?

Tim Bray suggests some reasons why not:
  • It's not always easy or fun
  • Pass/fail* ratio is low
  • Software pain
  • Network effects
  • Opportunity cost (time lost)
  • You can do it with XHTML (and microformats), or DocBook, or ODF, or UBL, or Atom

Should you design a language? (2)

And yet:
  • Can be very rewarding
  • Competitive advantage
  • Software leverage
  • Network effects
  • Cost of bad fit
  • Microformats are also vocabularies
    (which interact with their host vocabularies in intricate ways)
  • Often you can't do it with an off the shelf language

(II) What is information analysis?

Information analysis (or: document analysis) is the process of deciding systematically:
  • What’s relevant in your data
  • What’s useful in your data
  • What to identity in your data
  • What’s the framework/scaffolding supporting what you need
  • What constraints should be obeyed by the data

Why is analysis essential?

  • If you design your own vocabulary, what needs to be in it?
  • If you use an off-the-shelf language, what parts of it do you need?
  • If you adapt (cut down, extend, modify) an off-the-shelf language, what must you omit / add / change?

What’s relevant in your data?

What’s useful in your data?

Design a framework/scaffolding

Establish data constraints to

Constraints to ensure useful data

Constraints to ensure clean data

What you do during analysis

Who does it? Expert-based analysis

What’s wrong with this picture?

Only users know

Experts/consultants know

User-based analysis

Users construct solution, not just provide input

Facilitated analysis workshop

Who participates in analysis?

(V) Soundness, validity, formal rules, grammars, schema languages

Why define a language formally?

Consider a straightforward XML document:
<title>The duck</title>
<author>Ogden Nash</author>
<line>Behold the duck.</line>
<line>It does not cluck.</line>
<line>A cluck it lacks.</line>
<line>It quacks.</line>
<line>It is especially fond</line>
<line>Of a puddle or pond.</line>
<line>When it dines or sups</line>
<line>It bottoms-ups.</line>

When errors leap to the eye

Even if the data are meaningless, some errors are obvious :
<author>Btqra Anfu</author>
<line>Orubyq gur qhpx.</line>
<line>Vg qbrf abg pyhpx.</line>
<line>N pyhpx vg ynpxf.</line>
<line>Vg dhnpxf.</line>
<title>Gur qhpx</title>
<line>Vg vf rfcrpvnyyl sbaq</line>
<line>Bs n chqqyr be cbaq.</line>
<line>Jura vg qvarf be fhcf</line>
<line>Vg obggbzf-hcf.</line>

What the computer sees

What the computer sees, however, is less clear. This document is well-formed, but has several typos.
<gvgyr>Gur qhpx</gvgyr>
<nhgube>ol Btqra Anfu</nhgube>
<yvar>Orubyq gur qhpx.</yvar>
<yvar>Vg qbrf abg pyhpx.</yvar>
<yvar>N pyhpx vg ynpxf.</yvar>
<yvar>Vg dhnpxf.</yvar>
<yyar>Vg vf rfcrpvnyyl sbaq</yyar>
<yyar>Bs n chqqyr be cbaq.</yyar>
<yyar>Jura vg qvarf be fhcf</yyar>
<yvar>Vg obggbzf-hcf.</yvar>
Can you see them?

So, why define formally?

  • Precision
  • Explicitness
  • Mechanical validation

The Iron Law

Garbage* in, garbage out.
Three questions:
  • Can errors exist? or is every string of bits a possible message?
  • Can errors be found ...
    • automatically?
    • by clerical inspection?
    • through inspections by highly trained experts?
  • Is the cost of undetected errors ...
    • trivial?
    • small?
    • large?
    • catastrophic?

Can I define a vocabulary without a formal definition?

Of course you can! Examples:
  • Cobol, Fortan I (before BNF)
  • LaTeX, GML
  • HTML*
Can your application live without automatic detection of dirty data?

Validation as classification

Validation is a filter:
  • Distinguish
    • Valid
    • Invalid
In other terms
  • Test:
    • Member of set of valid messages?
    • No?

Validation as filter

What is a grammar?

  • Language as set of sentences
  • Grammar as definition of that set
    • Numerous ad hoc rules
    • Generative grammars (since 1956)

What is a generative grammar?

Informally, a recipe:
  1. Take a set of rewrite rules of the form left-hand-side = right-hand-side.
  2. Put a start symbol (e.g. expression) into a buffer.
  3. Choose some symbol* in your buffer.
  4. Find it on the left hand side of a rule.
  5. In your buffer, erase the symbol and insert the right hand side of the rule.
  6. Any symbols left?
    • If yes, then go to step 3.
    • If not, you're done. The contents of your buffer are a sentence in the language defined by the grammar given by the rules of step 1, the start symbol, and the sets of terminal and non-terminal symbols.

Example grammar

expression = number operator number
operator = '+'
operator = '-'
number = DIGIT 
number = number DIGIT
number = '(' expression ')'
DIGIT = '0' 
DIGIT = '1' 
DIGIT = '2'  
DIGIT = '9'

A derivation

Or, more conventionally, “(75453+3225)-(2600+5497)”

What good is a generative grammar?

  • Simple mechanism
  • Easy to reason about
  • Reversible* (given sequence of characters, is there a derivation for it?)
    • Recognition
    • Parsing

Grammar as filter / set recognizer (1)

Grammar as filter / set recognizer (2)

Grammar as filter / set recognizer (3)

Document grammars

  • Documents have grammars, too
    • document = front body back
    • purchase-order = billing-info shipping-info details
  • In XML,
    • Structure is clear even without schema
    • Schemas help check correctness
    • Contrast schemas for databases (or SGML)

What is validation?

Separating the valid from invalid.

Judgment day

Sheep on the right, goats on the left.

Drawing the line

And a great gulf shall be established between them.

When documents don’t cluster

Dr. Crustes, Dr. Pro Crustes, please call your office

Some ways XSD 1.0 is different

Some important grammar styles

  • Waterloo grammar: anything goes anywhere, as long as it's declared
  • Single-element vocabulary
    • Every element named “e
    • Distinguish by type attribute
    • Often proposed as a ‘reductio ad absurdum’...
    • ... but consider Microsoft's Office format

(VI) Basics of XSD

XML Schema 1.0 in ten slides

1 Schemas and validity

A schema defines a set of documents:
  • the ones with all the necessary information
  • the ones which convey meaning to the application
  • the ones the software knows how to process correctly
  • the ones the stylesheet knows how to handle gracefully
A schema defines a contract that goes beyond XML well-formedness:
  • I agree to accept any valid document.
  • You agree to send only valid documents.
  • The software promises to process valid documents correctly.
  • We promise not to invoke the software on invalid documents.
  • The stylesheet can be checked to make sure it handles all valid documents.

1′ Schemas and validation

Validation takes two inputs:
  • a schema, and
  • an input XML document
and produces one output:
  • an output document annotated with type information and validation results, the post-schema-validation infoset, or PSVI

2 Schemas, schema documents, components

A schema is an abstract object: a set of schema components.
We can define schema components in XML schema documents ... or through other means.
To validate, we:
  1. identify the element to validate (often the root element of a document);
  2. find or construct a schema;
  3. check the validity of that element and its descendants;
  4. annotate the input infoset with validity and type information.

3 Validation outcomes

It's not a black-and-white picture: there are shades of gray.
  • valid, invalid, not known
  • fully validated, not validated at all, partially validated
  • validation information on each element and attribute

4 Simple types

XML Schema 1.0 predefines a large set of basic datatypes:
  • numbers: decimal, float, double, integer, int, positiveInteger, negativeInteger, ...
  • dates and times: dateTime, date, time, gYearMonth, gYear, gMonthDay, gMonth, gDay
  • durations
  • binary data: hexBinary, base64binary
  • boolean
  • Unicode strings
  • miscellaneous: anyURI, QName, NOTATION
For each, XML Schema defines a lexical space, which maps to a value space.

5 Facets

Users can define their own simple types by constraining one or more facets of existing types:
  • min- and maxInclusive, -Exclusive
  • enumeration of values
  • length (absolute, minimum, maximum)
  • totalDigits, fractionDigits
  • pattern (constrains the lexical space)
 <xsd:simpleType name="grade">
  <xsd:restriction base="xsd:integer">
   <xsd:minInclusive value="0"/>
   <xsd:maxInclusive value="100"/>

6 Examples of simple types

Simple types can be used in declaring elements:
<xsd:element name="orderDate" type="xsd:date"/>
<xsd:element name="partnumber" type="my:partnum"/>
or in the document instance:
<my:shipDate xsi:type="xsd:date">2006-12-04</my:shipDate>

7 Complex types

Information has complex structure; complex types model that structure:
  • content model: What child elements can elements of this type have, in what order, with what types?
  • attributes: What attributes can elements of this type have, with what types?
  • global and local elements
  • global and local types
  • wildcards

8 Example of a complex types

<xsd:complexType name="address">
    <xsd:element name="name"   type="xsd:string"/>
    <xsd:element name="street" type="xsd:string"/>
    <xsd:element name="city"   type="xsd:string"/>
    <xsd:element name="state"  type="my:state-or-province"/>
    <xsd:element name="zip"    type="my:zipcode"/>

9 Schemas and namespaces

10 Writing and deploying schemas

(IX) Review and conclusion

Make it yours

The more you capture the truth of the data, as you see it,
the less you will be locked into a single application,
and the more your data will be useful to others.
The more carefully you document your analysis and markup for others,
the more useful it will be to you.