Constrain early and often:

XML Schema and the definition of document grammars for XML vocabularies

Fraunhofer

Institut für integrierte

Publikations- und Informationssysteme

C. M. Sperberg-McQueen

11 October 2001

Abstract

In the historical development of markup languages, few innovations have been more important than the introduction of the notion of document grammars for constraining documents and defining document types. Both SGML (ISO 8870) and XML 1.0 define a specialized notation (the DTD) for defining document grammars; more recently a number of alternative languages have been proposed. The W3C XML Schema language replicates the essential functionality of DTDs, and adds a number of features: the use of XML instance syntax rather than an ad hoc notation, clear relationships between schemas and namespaces, a systematic distinction between element types and data types, and a single-inheritance form of type derivation. In this talk, Michael Sperberg-McQueen, who serves as co-chair of the W3C XML Schema Working Group, will discuss the fundamental design issues of schema languages, and the choices made by the W3C group in defining XML Schema.

I.1. Overview
II.1. Some models of text
II.2. Tree-shaped models
II.3. Models of marked-up text
II.4. COCOA tags
II.5. Document grammars (DGs)
II.6. The uses of DGs
III.1. DTDs as DGs
III.2. A document grammar
III.3. A DTD
III.4. A limerick
III.5. A canzone
III.6. Note on the canzone DTD
III.7. Removing non-terminals
III.8. The canzone minus explicit Aufgesang
III.9. The canzone minus NTs
III.10. Removing all non-terminals
IV.1. XML Schema
IV.2. The canzone schema v.1
IV.3. Declaring elements
IV.4. Declaring elements
IV.5. Positive closure
IV.6. Character data
V.1. Supporting programming-language and dbms paradigms
V.2. The tag/type distinction
V.3. Top-level named complex types
V.4. Top-level complex types
V.5. Anonymous types
V.6. Simple datatypes
V.7. Built-in primitive datatypes
V.8. Built-in derived datatypes
V.9. What is an atomic?
V.10. Non-atomic simple types
V.11. Inheritance Type derivation
V.12. Inheritance in document systems
VI.1. Schemas and namespaces
VI.2. Schema layers
VI.3. Modularization
VI.4. Linking document and schema
VI.5. Post-schema-validation infoset (PSVI)
VI.6. Non-local effects
VI.7. Non-local effects in XML Schema
VI.8. Determinism

I. Overview

I.1. Overview

overview
the idea of document grammars
DTDs as document grammars
XML Schema
design issues and research questions

II. The idea of document grammars

II.1. Some models of text

Rectangular:

text   ::= record
record ::= CHAR*    // or CHAR{80}

Linear
```
text ::= CHAR*
```

II.2. Tree-shaped models

Unlabeled tree:

text   ::= node*
node   ::= node* | CHAR*

Fixed-depth, fixed-label tree:

text   ::= page*
page   ::= line* 
line   ::= CHAR*

II.3. Models of marked-up text

ODTAO:
```
text ::= (CHAR | COMMAND)*
```
This only looks regular.

Pseudo-grammars:

command ::= '\begin{' NAME '}'
          | '\end{' NAME '}'

command ::= ':' NAME '.'?
          | ':e'' NAME '.'?

command ::= '<' NAME attributes '>'
          | '</'' NAME '>'

II.4. COCOA tags

Virtual variables:
```
command ::= '<' CHAR value '>'
```

II.5. Document grammars (DGs)

Origin pragmatic, not theoretical.

Later aligned (partially) with language theory.

Formal specification of validity rules → automated validation. (Cf. Algol vs. Fortran.)

Distinction between document type definition (DTD) and “the set of effective formal declarations”.

II.6. The uses of DGs

Document grammars may have several uses:

in the struggle against dirty data
as documentation of a contract between data provider and data consumer
as documentation of the content of data flows
as specification of client/server protocols

III. DTDs as document grammars

III.1. DTDs as DGs

SGML/XML DTDs resemble Backus-Naur Form grammars, but:

They describe bracketed languages* ...
... so ‘non-terminals’ are visible*.
SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L).
They are not purely grammatical (notations, entities).
Determinism rule.

III.2. A document grammar

Limericks and canzone:

poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen   ::= line+
abgesang  ::= line+

III.3. A DTD

Limericks and canzone:

<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

III.4. A limerick

<limerick>
  <trimeter>
    There was a young lady named Bright
  </trimeter>
  <trimeter>
    whose speed was much faster than light.
  </trimeter>
  <dimeter>She set out one day,</dimeter>
  <dimeter>in a relative way,</dimeter>
  <trimeter>
    and returned on the previous night.
  </trimeter>
</limerick>

III.5. A canzone

<canzone>
  <aufgesang>
    <stollen>
      <l>unter den linden an der heide</l>
      <l>da unser zweier bette was</l>
    </stollen>
    <stollen>
      <l>da mugt ir vinden schone beide</l>
      <l>gebrochen bluomen unde gras</l>
    </stollen>
  </aufgesang>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

III.6. Note on the canzone DTD

All the non-terminals show up as tags.
The two Stollen must have same number of lines; this rule is not expressed.
The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

III.7. Removing non-terminals

<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

III.8. The canzone minus explicit Aufgesang

<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

III.9. The canzone minus NTs

<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>

III.10. Removing all non-terminals

<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >

ERROR: this DTD is illegal; why?

IV. XML Schema and DTD functionality

IV.1. XML Schema

DTD++, DTD--
instance syntax
supporting programming-language and database-oriented types
design problems

IV.2. The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.

At the outer level is a schema element:

<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>

N.B. the schema does not identify a document-root element / start symbol.

IV.3. Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

IV.4. Declaring elements

Note difference between element declaration (outer) and element reference (inner).
Implicit occurrence information: min = max = 1.

IV.5. Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

IV.6. Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="l" type="xsd:string"/>

V. Supporting PL/DBMS type notions

V.1. Supporting programming-language and dbms paradigms

tag/type distinction
named and anonymous datatypes
simple datatypes

V.2. The tag/type distinction

Let's generalize from

 <xsd:element name="l" type="xsd:string"/>

Can we do that for every element type?

N.B. four kinds of type

element type (vs. element, element instance)
data type
- simple type (lexical form has no markup)
- complex type (has element children)

In the example, l may be thought of as an accessor.

V.3. Top-level named complex types

Named types can be used to capture commonalities:

 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

V.4. Top-level complex types

... or just to provide a name for a type:

 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

V.5. Anonymous types

We can hide things using anonymous local types:

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Note nested declarations and definitions.

V.6. Simple datatypes

built-in
- primitive
- derived
user-defined

V.7. Built-in primitive datatypes

string
boolean
number, float, double
duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth
hexBinary, base64Binary
anyURI
QName
NOTATION

V.8. Built-in derived datatypes

normalizedString, token, language
IDREFS, ENTITIES, NMTOKEN, NMTOKENS, Name, NCName, ID, IDREF, ENTITY
integer, nonPositiveInteger, negativeInteger, long, int, short, byte, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, unsignedByte, positiveInteger

V.9. What is an atomic?

Extensional:

a set of values V
a set of lexical forms L
a mapping from L to V

Intensional:

a base mapping L → V
a set of fundamental facets:
- equality (identity)
- order (partial, total, none)
- boundedness
- cardinality
- numeric
a set of constraining facets:
- length, minLength, maxLength
- pattern (constrains lexical space)
- enumeration
- whiteSpace
- maxInclusive, maxExclusive, minInclusive, minExclusive
- totalDigits, fractionDigits

V.10. Non-atomic simple types

list (white-space delimited)
unions (ordered)

V.11. Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:

restriction (preserves subset semantics)
extension (preserves prefix semantics)

V.12. Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.

inheritance of attributes
inheritance of locations

VI. Design problems and research questions

VI.1. Schemas and namespaces

Some (unpleasant) facts of life:

Namespaces allow us to distinguish mine from not-mine.
Namespaces do not provide universal names.
The namespace : language relation is 1:n.
The language : grammar relation is 1:n.
Therefore, the namespace : schema relation is 1:n.

Live with it.

VI.2. Schema layers

We distinguish:

schema documents (with single target namespace)
schemas (sets of abstract components)

Schema composition operations:

import
include
include with override / redefine

VI.3. Modularization

XML Schema makes it possible to write modular document type definitions:

late collection of schema components
namespace-aware name matching, validation
white-box wildcards (lax / opportunistic)
black-box wildcards (skip)

VI.4. Linking document and schema

namespace name
schemaLocation hint

VI.5. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.

additions, no changes
type assignment information
validation-attempted information (strict, lax, skip)
validation-outcome information

VI.6. Non-local effects

Consider the HTML input element:

legal only in p and similar elements
legal only within form elements

SGML DTDs have partial solutions:

inclusion exceptions
content models

VI.7. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:

 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

VI.8. Determinism

The determinism rule remains controversial:

LL(1) guarantees may help implementors
All regular languages have a deterministic FSA;
... but not necessarily a deterministic regular expression!
Implications for closure under union, intersection.
Implications for subsumption tests.