Friday 17 October 2008

DSDL: initial thoughts

Prompted by some questions of Rick Jelliffe on the schematron list, I've been looking at DSDL lately. This is a multipart ISO/IEC standard defining various aspects of XML validation. The parts are at various stages; from fully ratified international standard to (I think) no existing draft at all. Below is a summary of collected thoughts on the various parts.

1: Overview

No great surprises, just an index into the following parts...

2: Regular-grammar-based validation - RELAX NG

For anyone working with document oriented XML, Relax NG is the gold standard of schema languages. More expressive than either XSD or DTD, more naturally namespace aware and (especially in its compact syntax form) particularly easy to write.

3: Rule-based validation - Schematron

I think this is probably the oldest of the languages that were pulled together to form DSDL. It's stood the test of time. I've ever really had occasion to use it much but I have from time to time made experiments with schematron, including the first implementation of the html based "schematron-report" version, and somewhat more recently some experiments on this blog.

4: Namespace-based validation dispatching language - NVDL

NVDL, or something like it, should perhaps form the basis of a solving the perennial problem of forming "Compound Documents" such as including MathML in a host document type such as XHTML, or DocBook, or TEI, etc. Nagging doubts that it sometimes seems rather heavyweight just for validation, for any particular case (such as MathML in XHTML) it's usually possible to construct a compound schema directly, and as we found while looking at W3C CDF, in many contexts defining a compound schema is the easy part of defining a compound document format, the real problems lurk elsewhere defining the behaviour and inheritance of properties across the interface between the languages. Such questions as should the current font, or font size be inherited from the host document type to the embedded fragment.

However problems of property inheritance and event propagation are rightly out of scope for a validation language, so this really is just musing on a perennial problem the we have with MathML in (anything), which NVDL doesn't really address

5: Data Type Library Language - DTLL

I don't know, this is more or less the right thing, although I sent some minor comments re use of xpath2 (which the current draft avoids in favour of xpath 1).

But perhaps it's just too late. XSD datatypes are rather horrible and rather inflexible but they more or less do the job, most of the time, and even Relax NG users are by now in the habit of using xs:boolean and friends, as of course are Xpath 2 users. perhaps it is too late for this to ever gain traction, but perhaps not...

6: Path-based integrity constraints

This part appears to be on hold, with no public draft.

7: Character Repertoire Description Language - CRDL

CRDL, or CREPDL as it appears to be known now is what sparked my current interest in DSDL. specifically Rick Jelliffe asked on the schematron list for code to convert a crepdl specification into a schematron (which is effectively the same as converting it into one or more xpath expressions.

I sketched out a rough implementation in that thread, but actually I think that this is perhaps harder than it need be as CREPDL is using too powerful a technology to express character ranges. Regular expressions are highly efficient mechanism for specifying substrings, but CREPDL, as currently specified just really specifies single characters. A character repertoire is just a partition of the Unicode code range into three (characters that are definitely in, definitely out, or maybe in the repertoire.) However if regexp were not used, a different syntax would have to be invented for character ranges, and I don't have any good suggestions here, so perhaps using regex is OK, perhaps...

8: Document Schema Renaming Language - DSRL

It's difficult to know what to say about this section. WG1 recently published a Defect Report detailing some of the comments I'd raised on the public comment list. But really that list just scratches the surface. The specification as it stands is completely contradictory and unimplementable.

9: Datatype- and Namespace-aware DTDs

This seems to be technically sound, but doomed attempt to give a veneer of namespace respectability to DTD. Perhaps in 1998 this might have had a chance of taking off but now, post Relax NG, I can't see the point. DTD are not going to go away any time soon despite predictions in some quarters, at NAG for example we use DTD extensively, but if I want a namespace aware grammar language I'd use Relax NG every time rather than a DTD syntax with a collection of processing instructions giving namespace bindings.

10: Validation Management

This part appears to be on hold, with no public draft.

Wednesday 15 October 2008

XML 1.0 Fifth Edition

The W3C is about to publish XML 1.0 5th edition, that is assuming that my (and others') objections are overruled.

This sets a really terrible precedence, and sadly puts XML into a similar state as HTML, where the specification will be widely ignored (as it will be inconsistent) and people will have no choice other than just to test against a collection of major implementations and do whatever they do. The position for HTML is so bad that the editor for HTML 5 is on record as saying that the HTML 4 specification is essentially irrelevant to HTML5, and HTML5 is instead based on a formalisation of existing implementation behaviour. XML was intended to move markup languages away from such "tag soup" and base everything on a well specified foundation.

XML 1.1 changed the rules for XML names (in a good way) allowing a very much more open set of characters to be used in XML names. However XML 1.1 has not had wide take-up, and so the XML core WG has decided to use subterfuge of changing the XML Recommendation in place by introducing a fake errata that changes the Name production.

There is an attempt to trivialise my and similar objections as process objections. Clearly it is a gross abuse of process but that is not the main point of the objection. It is a technical issue. 5th edition places every specification that refers to XML into a completely unspecified status. Do the features of the language use the original XML 1.0 production or the incompatible one in the 5th edition? I asked for a simple yes/no answer to the question of whether it would be conformant to use the new characters in xpath. It is clear from the reply that even members of the XML Core (and W3C TAG) groups can not say definitively whether a single xpath step using such a character is conformant or not. If Henry Thompson can't answer this, how can anyone expect a normal developer to know the answer? The issue is not restricted to XPath, the same lack of clarity surrounds simple questions as to whether IDs using such a character are valid in SVG, or DocBook, or any other language you care to name.

No doubt the development community will recover and make things work, but as I said above users will have to go by what the implementors do, they will no longer be able to go by the specifications, which is a shame, that might yet have bad consequences.

At the very least the TAG ought to update its finding on versioning strategies to explain how, if a user community shows some resistance to using a new version, a useful approach is to remove choice by making incompatible changes in place, but without changing either the major or minor version number.