Notes on schema-validation results

Notes on schema-validation results Cambridge Sophia-Antipolis Tokyo World Wide Web Consortium 2001

Created in electronic form.

Notes on schema-validation results C. M. Sperberg-McQueen 7 December 2001

This document describes the results of schema validation as described in the W3C Recommendation XML Schema 1.0, in particular the ways those results differ from the results of (DTD-based) validation as described in ISO 8879 and the XML 1.0 specification.

It is based on email sent by the author to the XML Query Working Group in June 2001.

Introduction

At 2001-05-02 10:37, C. M. Sperberg-McQueen wrote: (full text at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001May/0028.html)

Section 3.3 [of the 27 April data model] has a paragraph which reads

A "schema-invalid document" is an XML document that has a corresponding schema but whose schema-validity assessment has resulted in one or more element/attribute information items being assigned values other than 'valid' for the [validity] property in the PSVI.

I think the concept outlined here is going to be important, but I am uncomfortable with the term used for it (without being able to propose a better one at the moment). ...

XML Schema distinguishes valid nodes (on which strict assessment of validity was attempted, and found the node valid), invalid nodes (on which strict assessment was attempted, and found an error), and nodes for which the validity status is 'notKnown'. Our definition covers documents within which any element or attribute is 'invalid' or 'notKnown', which means it also covers documents within which any node was processed as a 'black box' (i.e. skipped during validation), in addition to documents within which there is some detected error. The term 'schema-invalid' will tend to suggest to the non-paranoid reader a meaning narrower than that given by the definition.

We discussed this at the XML Query face to face in May and Mary suggested substituting the term "incompletely validated". Although I had suggested this term myself in my note of 2 May, I find that I am as troubled by it as by the term "invalid".

Upon consideration, I think I now know why. The fact of the matter is that XML Schema (a) provides more information about schema-validity than a single bit (valid/invalid), and (b) provides schema-validity information not just about the document as a whole but about each element and attribute. If we want the data model to cover more than only the set of schema-valid documents, I think we need (a) to make more than a binary distinction ourselves, and (b) to consider validity as a property of (and validation an operation on) elements / subtrees, not solely documents.

At the very least, if we want to go beyond fully-validated schema-valid documents, we need to come to grips with which set of documents, other than the schema-valid documents, we actually want to cover, and how.

The purpose of the rest of this note is to provide a list of the cases I think can usefully be distinguished, and note where we have decisions to make. If people agree that we need something more than a binary switch, I will be willing to attempt formulating specific language for the document-model document.

A conforming XML Schema processor provides information on (inter alia) the ancestor element at which schema-validation ('assessment') started whether this particular element and its descendants were schema-validated or not the result of the assessment the type associated with the element

The various combinations of values of the [validation attempted], [validity], and [type definition] properties can usefully distinguish several cases: eight by my count. A diagram showing the various combinations of [validation attempted] and [validity] is at http://www.w3.org/XML/Group/2001/06/validity-outcomes if that helps.

(Pedantic note: conforming XML Schema processors are allowed not to provide the [type definition] property, if instead they provide a bundle of properties including the [type definition name], [type definition namespace], and [type definition anonymous] properties. I ignore such light-weight processors here because I assume that XML Query will require access to the type definition components themselves. Anyone not sharing that assumption may implicitly insert the phrase "(or [type definition name] and related properties)" wherever I mention the [type definition] property, as long as you adjust quantifiers and negation properly.)

When the entire subtree has been schema-validated Full validation, valid

First, we can distinguish three cases in which the entire subtree has been schema-validated:

1 This element, and all of its descendants, have been checked and are schema-valid. This is the rough equivalent of DTD-based validation: everything has a declaration, and everything conforms to the declaration. [validation attempted] = "full" [validity] = "valid" [type definition] property is present

The Query/XPath data model has to cover these elements.

Full validation, invalid

2 This element, and all of its descendants, have been checked and there is a problem right here at this element (and possibly also with some descendant). [validation attempted] = "full" [validity] = "invalid" [type definition] property is not present

We need to decide whether the Query/XPath data model should cover these elements and/or their descendants. It seems plausible to want to cover at least all fully-assessed schema-valid descendants (i.e. descendants in class 1). We can also cover the element with the problem by treating it as if it had the urType.

Full validation, locally valid

3 This element, and all of its descendants, have been checked and while this element is 'locally valid', some descendant is invalid. [validation attempted] = "full" [validity] = "invalid" [type definition] property is present

This will be the description of the top-level element in a database, if one attribute in one record is out of bounds. It seems plausible to want to cover at least these elements, and probably at least some of their descendants (i.e. at least those descendants which are also in this class).

Partial schema-validation

Second, there are four cases in which part of the subtree has been schema-validated and part not.

Schema-validity will not be assessed on elements or attributes if they or some ancestor matches an ANY wildcard which prescribes "skip" processing. Skip processing forbids schema-validity assessment and creates a 'black-box' location in a document in which any well-formed XML is legal. Schema-validity will also not be assessed for elements and attributes if (a) they or some ancestor matches an a wildcard which prescribes "lax" processing and (b) no declaration was available for some descendant. Lax processing calls for schema-validity to be assessed for elements and attributes if matching declarations are available, and skipped if declarations are not available; it creates a 'white box' in which undeclared elements and attributes are allowed, but in which all elements and attributes are schema-validated if declarations are available for them.

Partial validation, valid item

4 This element has been schema-validated, and is schema-valid (which means also that none of its attributes or children is invalid or missing a required declaration), but some descendant is not marked "valid". [validation attempted] = "partial" [validity] = "valid" [type definition] property is present

I believe we want to cover these elements in our data model.

Partial validation, (locally) invalid

5 This element has been schema-validated, and is invalid because there is a problem right here at this element. [validation attempted] = "partial" [validity] = "invalid" [type definition] property is not present

I believe we need to decide whether we want to cover these elements in our data model. If we do wish to cover them, we can do so (I think) by assigning them the urType.

Partial validation, locally valid

6 This element has been schema-validated, and is invalid because although it's OK 'locally', it has some invalid descendant. [validation attempted] = "partial" [validity] = "invalid" [type definition] property is present

I believe we do want to cover these elements in our data model.

Partial validation, locally unvalidated

7 This element has not been schema-validated, but at least one of its descendants has been. [validation attempted] = "partial" [validity] = "notKnown" [type definition] property not present

I believe we need to decide whether we want to cover these elements in our data model. I believe we do, and that we can do so by assigning them the urType.

When the subtree was skipped

Finally, there is one case in which no part of the subtree has been schema-validated.

Unvalidated

8 Neither this element nor any of its descendants has been schema-validated. [validation attempted] = "none" [validity] = "notKnown" [type definition] and [type definition name] property not present

The Query/XPath data model can easily cover these elements by assigning the urSimpleType to all attributes and the urType to all elements.

I think we can cover all these cases, if we simply assign the urType and urSimple type to items which have no [type definition] property. The question is, so we wish to do so? (If we do, we need to be careful to distinguish elements associated with the urType by the schema validator and those for which the association with the urType came from the query system, not the schema validator.)