<!DOCTYPE TEI.2 PUBLIC '-//TEI//DTD TEI Lite 1.0//EN'
    "../lib/swebxml.dtd">
<?xml-stylesheet href="../lib/tltohtml.xsl" type="text/xsl"?>
<TEI.2>
<teiHeader>
<fileDesc>
<titleStmt>
<title>Notes on schema-validation results</title>
</titleStmt>
<publicationStmt>
<pubPlace>Cambridge</pubPlace>
<pubPlace>Sophia-Antipolis</pubPlace>
<pubPlace>Tokyo</pubPlace>
<publisher>World Wide Web Consortium</publisher>
<date>2001</date>
</publicationStmt>
<sourceDesc>
<p>Created in electronic form.</p>
</sourceDesc>
</fileDesc>
</teiHeader>
<text>
<front>
<titlePage>
<docTitle>
<titlePart>Notes on schema-validation results</titlePart>
</docTitle>
<docAuthor>C. M. Sperberg-McQueen</docAuthor>
<docDate>7 December 2001</docDate>
</titlePage>
</front>
<body>
<p>This document describes the results of 
schema validation as described in the W3C Recommendation
<title>XML Schema 1.0</title>, in particular the ways those
results differ from the results of (DTD-based) validation
as described in ISO 8879 and the XML 1.0 specification.</p>
<p>It is based on email sent by the author to the XML Query
Working Group in June 2001.</p>
<div1>
<head>Introduction</head>
<p>
At 2001-05-02 10:37, C. M. Sperberg-McQueen wrote: (full text at 
<xref>http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001May/0028.html</xref>)
<q rend="block"><p>
Section 3.3 [of the 27 April data model] has a paragraph which reads
<q rend="block"><p>
A "schema-invalid document" is an XML document that has
a corresponding schema but whose schema-validity assessment
has resulted in one or more element/attribute information
items being assigned values other than 'valid' for the
[validity] property in the PSVI.
</p></q></p>
<p>
I think the concept outlined here is going to be important, but I am
uncomfortable with the term used for it (without being able to
propose a better one at the moment).
...
</p>
<p>
XML Schema distinguishes valid nodes (on which strict assessment of
validity was attempted, and found the node valid), invalid nodes (on
which strict assessment was attempted, and found an error), and nodes
for which the validity status is 'notKnown'.  Our definition covers
documents within which any element or attribute is 'invalid' or
'notKnown', which means it also covers documents within which any
node was processed as a 'black box' (i.e. skipped during validation),
in addition to documents within which there is some detected error.
The term 'schema-invalid' will tend to suggest to the non-paranoid
reader a meaning narrower than that given by the definition.
</p></q>
</p>
<p>
We discussed this at the XML Query face to face in May and Mary 
suggested substituting the term "incompletely validated".  Although 
I had suggested this term myself in my note of 2 May, I find that I 
am as troubled by it as by the term "invalid".
</p>
<p>
Upon consideration, I think I now know why.  The fact of the matter is
that XML Schema (a) provides more information about schema-validity
than a single bit (valid/invalid), and (b) provides schema-validity
information not just about the document as a whole but about each
element and attribute.  If we want the data model to cover more than
only the set of schema-valid documents, I think we need (a) to make
more than a binary distinction ourselves, and (b) to consider validity
as a property of (and validation an operation on) elements / subtrees,
not solely documents.
</p>
<p>
At the very least, if we want to go beyond fully-validated
schema-valid documents, we need to come to grips with which set of
documents, other than the schema-valid documents, we actually want to
cover, and how.
</p>
<p>
The purpose of the rest of this note is to provide a list of the cases
I think can usefully be distinguished, and note where we have
decisions to make.  If people agree that we need something more than
a binary switch, I will be willing to attempt formulating specific
language for the document-model document.
</p>
<p>
A conforming XML Schema processor provides information on (inter alia)
<list>
<item>the ancestor element at which schema-validation ('assessment')
started</item>
<item>whether this particular element and its descendants were 
schema-validated or not</item>
<item>the result of the assessment</item>
<item>the type associated with the element</item>
</list>
</p>
<p>
The various combinations of values of the [validation attempted],
[validity], and [type definition] properties can usefully distinguish
several cases: eight by my count.  A diagram showing the various
combinations of [validation attempted] and [validity] is at
<xref>http://www.w3.org/XML/Group/2001/06/validity-outcomes</xref> if
that helps.
</p>
<p>
(Pedantic note: conforming XML Schema processors are allowed not to
provide the [type definition] property, if instead they provide a
bundle of properties including the [type definition name], [type
definition namespace], and [type definition anonymous] properties.  I
ignore such light-weight processors here because I assume that XML
Query will require access to the type definition components
themselves.  Anyone not sharing that assumption may implicitly insert
the phrase "(or [type definition name] and related properties)"
wherever I mention the [type definition] property, as long as you
adjust quantifiers and negation properly.)
</p>
</div1>

<div1>
<head>When the entire subtree has been schema-validated</head>
<div2>
<head>Full validation, valid</head>
<p>
First, we can distinguish three cases in which the entire subtree has
been schema-validated:
</p>
<p>
1 This element, and all of its descendants, have been checked and are
schema-valid.  This is the rough equivalent of DTD-based validation:
everything has a declaration, and everything conforms to the
declaration.
<eg>
  [validation attempted] = "full"
  [validity] = "valid"
  [type definition] property is present
</eg>
</p>
<p>
The Query/XPath data model has to cover these elements.
</p>
</div2>
<div2><head>Full validation, invalid</head>
<p>
2 This element, and all of its descendants, have been checked and
there is a problem right here at this element (and possibly also with
some descendant).
<eg>
  [validation attempted] = "full"
  [validity] = "invalid"
  [type definition] property is not present
</eg></p>
<p>
We need to decide whether the Query/XPath data model should cover
these elements and/or their descendants.  It seems plausible to want
to cover at least all fully-assessed schema-valid descendants
(i.e. descendants in class 1).  We can also cover the element with the
problem by treating it as if it had the urType.
</p>
</div2>
<div2>
<head>Full validation, locally valid</head>
<p>
3 This element, and all of its descendants, have been checked and
while this element is 'locally valid', some descendant is invalid.
<eg>
  [validation attempted] = "full"
  [validity] = "invalid"
  [type definition] property is present
</eg></p>
<p>
This will be the description of the top-level element in a database,
if one attribute in one record is out of bounds.  It seems plausible
to want to cover at least these elements, and probably at least some
of their descendants (i.e. at least those descendants which are also
in this class).
</p>
</div2>
</div1>

<div1>
<head>Partial schema-validation</head>
<p>
Second, there are four cases in which part of the subtree has been
schema-validated and part not.  
</p>
<p>
Schema-validity will not be assessed on elements or attributes if they
or some ancestor matches an <kw>ANY</kw> wildcard which prescribes "skip"
processing.  Skip processing forbids schema-validity assessment and
creates a 'black-box' location in a document in which any well-formed
XML is legal.  Schema-validity will also not be assessed for elements
and attributes if (a) they or some ancestor matches an a wildcard
which prescribes "lax" processing and (b) no declaration was available
for some descendant.  Lax processing calls for schema-validity to be
assessed for elements and attributes if matching declarations are
available, and skipped if declarations are not available; it creates a
'white box' in which undeclared elements and attributes are allowed,
but in which all elements and attributes are schema-validated if
declarations are available for them.
</p>
<div2>
<head>Partial validation, valid item</head>
<p>
4 This element has been schema-validated, and is schema-valid (which
means also that none of its attributes or children is invalid or
missing a required declaration), but some descendant is not marked
"valid".
<eg>
  [validation attempted] = "partial"
  [validity] = "valid"
  [type definition] property is present
</eg></p>
<p>
I believe we want to cover these elements in our data model.
</p>
</div2>
<div2>
<head>Partial validation, (locally) invalid</head>
<p>
5 This element has been schema-validated, and is invalid because there
is a problem right here at this element.
<eg>
  [validation attempted] = "partial"  
  [validity] = "invalid"
  [type definition] property is not present
</eg></p>
<p>
I believe we need to decide whether we want to cover these elements in
our data model.  If we do wish to cover them, we can do so (I think) by
assigning them the urType.
</p>
</div2>
<div2>
<head>Partial validation, locally valid</head>
<p>
6 This element has been schema-validated, and is invalid because
although it's OK 'locally', it has some invalid descendant.
<eg>
  [validation attempted] = "partial"
  [validity] = "invalid"
  [type definition] property is present
</eg></p>
<p>
I believe we do want to cover these elements in our data model.
</p>
</div2>
<div2>
<head>Partial validation, locally unvalidated</head>
<p>
7 This element has not been schema-validated, but at least one of 
its descendants has been.
<eg>
  [validation attempted] = "partial"
  [validity] = "notKnown"
  [type definition] property not present
</eg></p>
<p>
I believe we need to decide whether we want to cover these elements in
our data model.  I believe we do, and that we can do so by assigning
them the urType.
</p>
</div2>
</div1>

<div1>
<head>When the subtree was skipped</head>
<p>
Finally, there is one case in which no part of the subtree has been
schema-validated.
</p>
<div2><head>Unvalidated</head>
<p>
8 Neither this element nor any of its descendants has been
schema-validated.  
<eg>
  [validation attempted] = "none"
  [validity] = "notKnown"
  [type definition] and [type definition name] property not present
</eg></p>
<p>
The Query/XPath data model can easily cover these elements by
assigning the urSimpleType to all attributes and the urType to
all elements.  
</p>
<p>
I think we can cover <emph>all</emph> these cases, if we simply assign
the urType and urSimple type to items which have no [type definition]
property.  The question is, so we wish to do so?  (If we do, we need
to be careful to distinguish elements associated with the urType by
the schema validator and those for which the association with the
urType came from the query system, not the schema validator.)
</p>
</div2>
</div1>
</body>
</text>
</TEI.2>
<!--* "http://www.hcu.ox.ac.uk/TEI/Lite/DTD/teixlite.dtd"> *-->
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:(concat sgmlvol "/SGML/Public/Emacs/teilite.ced")
sgml-omittag:t
sgml-shorttag:t
End:
-->

