| The Lyons Family Website | |||||||||||||||||||||||||
The Undecidability of the Schematron Conformance Problemby Bob Lyons
|
|||||||||||||||||||||||||
|
aabb |
|
ccab |
Can one or more of these rectangles be arranged in a row (left to right), such that the following requirements are met?
So, we can reformulate our first example as the following set of rectangles:
|
a |
|
aa |
|
ab |
|
b |
The strings in the top halves of the rectangles are the strings in list L, and the strings in the bottom halves of the rectangles are the strings in list M. The first string in list L (i.e., "a") is the string in the top half of the first rectangle, and the second string in list L (i.e., "ab") is the string in the top half of the second rectangle. Likewise, the first string in list M (i.e., "aa") is the string in the bottom half of the first rectangle, and the second string in list L (i.e., "b") is the string in the bottom half of the second rectangle.
This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row (i.e., "aab") is equal to the concatenation of the strings in the bottom half of the row:
|
a |
ab |
|
aa |
b |
We can reformulate our second example as the following set of rectangles:
|
a |
|
aaa |
|
abaaa |
|
ab |
|
ab |
|
b |
This instance of the PCP can be solved with the following row, since the concatenation of the strings in the top half of the row is "abaaaaaab" and the concatenation of the strings in the bottom half of the row is also "abaaaaaab":
|
abaaa |
a |
a |
ab |
|
ab |
aaa |
aaa |
b |
Now that we understand the PCP, we can get back to our proof that the SCP is undecidable.
Recall that we assumed that the SCP is solvable (i.e., we assumed that there is an algorithm for solving the SCP). Now we extend this algorithm into a new algorithm that solves the PCP.
The input to the new algorithm is an instance of the PCP (i.e., two lists of nonempty strings over an alphabet containing at least two letters, such that the two lists have the same number of strings).
The following is the new algorithm that solves the PCP:
So, we now have an algorithm that solves the PCP. However, the PCP is unsolvable. So, our initial assumption (that there is an algorithm for solving the SCP) must be false. Therefore, the SCP is undecidable.
Now we'll show how a PCP instance can be transformed into a Schematron schema, such that an XML document conforms to the schema only if the XML document contains a solution to the PCP instance.
First, we describe the format of the XML documents that contain solutions to PCP instances. The following XML document contains a solution for the PCP instance of example #2:
<?xml version="1.0"?>
<pcp-instance-solution>
<list-1>
<string id="2">abaaa</string>
<string id="1">a</string>
<string id="1">a</string>
<string id="3">ab</string>
</list-1>
<list-2>
<string id="2">ab</string>
<string id="1">aaa</string>
<string id="1">aaa</string>
<string id="3">b</string>
</list-2>
</pcp-instance-solution>
Recall that a solution to this PCP instance
is the following sequence of integers:
( 2, 1, 1, 3 ). This sequence of integers is
specified by the id attributes of the
/pcp-instance-solution/list-1/string elements;
the solution is specified again by the
id attributes of the
/pcp-instance-solution/list-2/string
elements.
The /pcp-instance-solution/list-1 element
contains the 2nd string of list L, followed by
the 1st string of list L, followed by the 1st
string of list L, followed by the 3rd string
of list L. The /pcp-instance-solution/list-2 element
contains the 2nd string of list M, followed by
the 1st string of list M, followed by the 1st
string of list M, followed by the 3rd string
of list M.
Notice that the concatenation of the values
of the /pcp-instance-solution/list-1/string elements
is equal to the concatenation of the values
of the /pcp-instance-solution/list-2/string elements.
(In other words,
"abaaaaaab"
is equal to
"abaaaaaab".)
You may have noticed that this XML document contains
much more information than the sequence of integers
that solves the problem. The extra information makes
it easier for the reader to verify the solution.
Also, if the document merely specified a list of integers that
solved the PCP instance, then I would not have been
able to write a Schematron schema that verifies
that the solution is correct (unless the schema
used the document function to invoke
a CGI script, which would be cheating!).
The following is a Schematron schema for example #2. An XML document will match this schema only if it provides a solution for the PCP instance of example #2.
<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
<!-- This Schematron schema is used to verify that
an XML document contains a solution to the
following instance of the Post Correspondence
Problem (PCP):
Alphabet A = { a, b }
List L = ( a, abaaa, ab )
List M = ( aaa, ab, b )
An XML document conforms to this Schematron
schema if it contains a solution to this
PCP instance.
If an XML document conforms to this Schematron
schema, then the schema will not issue any
messages when validating the document. If an
XML document does not conform to this Schematron
schema, then the schema will issue one or more
error messages when validating the document.
-->
<pattern name="check-root">
<rule context="/">
<assert test="pcp-instance-solution">
The /pcp-instance-solution element was not
found.
</assert>
</rule>
</pattern>
<pattern name="check-pcp-instance-solution">
<rule context="/pcp-instance-solution">
<assert test="list-1">
The /pcp-instance-solution/list-1 element
is missing.
</assert>
<assert test="list-2">
The /pcp-instance-solution/list-2 element
is missing.
</assert>
<assert test="count(*) = 2">
The /pcp-instance-solution element has an
incorrect number of subelements.
</assert>
<assert test="translate(
normalize-space(list-1),
' ',
'' ) =
translate(
normalize-space(list-2),
' ',
'' )">
The concatenation of the
/pcp-instance-solution/list-1/string/text()
nodes is not equal to the
concatenation of the
/pcp-instance-solution/list-2/string/text()
nodes.
</assert>
</rule>
</pattern>
<pattern name="check-list-1">
<rule context=
"/pcp-instance-solution/list-1/string[@id='1']">
<assert test=". = 'a'">
The value of the
/pcp-instance-solution/list-1/string[@id='1']
element is not equal to 'a'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-1/string[@id='2']">
<assert test=". = 'abaaa'">
The value of the
/pcp-instance-solution/list-1/string[@id='2']
element is not equal to 'abaaa'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-1/string[@id='3']">
<assert test=". = 'ab'">
The value of the
/pcp-instance-solution/list-1/string[@id='3']
element is not equal to 'ab'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-1/string">
<assert test="@id = '1' or
@id = '2' or
@id = '3'">
The value of a
/pcp-instance-solution/list-1/string/@id
attribute is not equal to '1', '2' or '3'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-1/text()">
<assert test="normalize-space( . ) = ''">
Found a non-whitespace text node as
child of the
/pcp-instance-solution/list-1 element.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-1/*">
<assert test="name( . ) = 'string'">
The /pcp-instance-solution/list-1 element
has a subelement other than a string
element.
</assert>
<assert test="@id">
A subelement of the
/pcp-instance-solution/list-1 element
has no id attribute.
</assert>
</rule>
<rule context="/pcp-instance-solution/list-1">
<assert test="string">
The /pcp-instance-solution/list-1 element
does not contain any string subelements.
</assert>
</rule>
<rule context="/pcp-instance-solution/list-1">
<assert test="count( string ) =
count( ../list-2/string )">
The /pcp-instance-solution/list-1 element
does not have the same number of string
elements as the
/pcp-instance-solution/list-2 element.
</assert>
</rule>
</pattern>
<pattern name="check-list-2">
<rule context=
"/pcp-instance-solution/list-2/string[@id='1']">
<assert test=". = 'aaa'">
The value of the
/pcp-instance-solution/list-2/string[@id='1']
element is not equal to 'aaa'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2/string[@id='2']">
<assert test=". = 'ab'">
The value of the
/pcp-instance-solution/list-2/string[@id='2']
element is not equal to 'ab'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2/string[@id='3']">
<assert test=". = 'b'">
The value of the
/pcp-instance-solution/list-2/string[@id='3']
element is not equal to 'b'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2/string">
<assert test="@id = '1' or
@id = '2' or
@id = '3'">
The value of the
/pcp-instance-solution/list-2/string/@id
attribute is not equal to '1', '2' or '3'.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2/text()">
<assert test="normalize-space( . ) = ''">
Found a non-whitespace text node as the
child of the
/pcp-instance-solution/list-2 element.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2/*">
<assert test="name( . ) = 'string'">
The /pcp-instance-solution/list-2 element
has a subelement other than a string
element.
</assert>
<assert test="@id">
A subelement of the
/pcp-instance-solution/list-2 element
has no id attribute.
</assert>
</rule>
<rule context=
"/pcp-instance-solution/list-2">
<assert test="string">
The /pcp-instance-solution/list-2 element
does not contain any string subelements.
</assert>
</rule>
</pattern>
<pattern name="check-correspondence">
<rule context=
"/pcp-instance-solution/list-1/string">
<assert test=
"@id = ../../list-2/string[
count( current()/preceding-sibling::* )
+ 1 ]/@id">
The value of an id attribute in a
/pcp-instance-solution/list-1/string
element is not equal to the value of the
id attribute of the corresponding
/pcp-instance-solution/list-2/string
element.
</assert>
</rule>
</pattern>
</schema>
The validations performed by this schema include the following, which verify that the solution solves the PCP instance:
string elements in the
list-1 element must be equal to the
concatenation of the values of the
string elements in the
list-2 element.
list-1/string element
must be equal to the ith string in list L,
where i is the value of the id
attribute of the list-1/string element.
list-2/string element
must be equal to the ith string in list M,
where i is the value of the id
attribute of the list-2/string element.
@id attribute in
the nth string
element of the list-1 element must
be equal to the
value of the @id attribute in
the nth string
element of the list-2 element.
list-1 element must contain at least one
string subelement.
list-2 element must contain at least one
string subelement.
list-1 element must have the same number
of string subelements as the list-2 element.
The schema also performs the following validations of the structure of the document:
pcp-instance-solution.
pcp-instance-solution must be the list-1 and
list-2 elements.
list-1 and list-2 elements must be
string elements.
string element must contain an
id attribute.
string element must not have any
subelements and must not contain any attributes other
than the id attribute.
A corollary of the undecidability of the Schematron Conformance Problem (SCP) is that the Schematron Intersection Problem (SIP) is undecidable. The SIP is defined as follows:
Given two Schematron schemas, is there an XML document that conforms to both schemas?
To prove the corollary, we show, by way of contradiction, that if the SIP is decidable, then the SCP is decidable. If the SIP is decidable, then we can devise an algorithm that accepts any two Schematron schemas as input and responds with one of the following:
Now we extend this algorithm into a new algorithm that solves the SCP.
The input to the new algorithm is a Schematron schema, which we'll call S. We pass S and another Schematron schema, which we'll call T, as inputs to the algorithm that solves the SIP. T is the following Schematron schema, which matches all XML documents:
<?xml version="1.0"?>
<schema xmlns="http://www.ascc.net/xml/schematron">
<!-- This schematron schema matches any XML document.
-->
<pattern name="check-root">
<rule context="/">
<assert test="count(*) = 1">
Error: The number of document elements is
not equal to 1.
</assert>
</rule>
</pattern>
</schema>
Our new algorithm responds with one of the following:
Thus, the new algorithm solves the SCP (i.e., the SCP is decidable). However, since we showed that the SCP is undecidable, our initial assumption (that the SIP is decidable) must be false. Therefore, the SIP is undecidable.
Is the schema conformance problem undecidable for other XML schema languages, such as XML Schema, DTD and RELAX NG? In their paper entitled What's Hard about XML Schema Constraints?, Marcelo Arenas, Wenfei Fan and Leonid Libkin show that the XML Schema Conformance Problem is undecidable.
The DTD Conformance Problem looks solvable. It seems easy to construct an XML document that conforms to a DTD, as long as the DTD contains at least one element type declaration and contains an element type declaration for each child element type that is mentioned in the element type declarations.
It is possible to write a RELAX NG schema, such that there are no XML documents that conform to the schema. The trick is to use a data type that includes two contradictory constraints (e.g., a string must consist of an even number of 'a' characters, but must be 3 characters in length).
For example, there are no XML documents that conform to the following RELAX NG schema:
<?xml version="1.0" encoding="UTF-8"?>
<element name="conflicted"
xmlns="http://relaxng.org/ns/structure/1.0"
datatypeLibrary=
"http://www.w3.org/2001/XMLSchema-datatypes">
<data type="string">
<param name="pattern">(aa)+</param>
<param name="length">3</param>
</data>
</element>
So, the RELAX NG Conformance Problem is an interesting problem, but it doesn't seem to be unsolvable.
In this paper, we showed that the
Schematron Conformance Problem
(SCP) is undecidable.
This is true even if we restrict ourselves to
Schematron schemas that do not use the
document and key functions of XPath.
To prove the undecidability of the SCP, we showed that
if the SCP is decidable, then the Post Correspondence
Problem is decidable; however, the Post Correspondence
Problem is undecidable. Therefore, the SCP must be undecidable.
The fact that the SCP is undecidable means that it's impossible to build a Schematron schema editor that evaluates the user's schema and always lets him/her know whether or not there are any XML documents that conform to the schema.
Thanks to John Cowan for his comments on DTD conformance.
The following resources provide information about XML schema languages:
The following resources provide information about the Post Correspondence Problem, the Halting Problem and undecidability: