XML schema languages

XML schema languages How's a User to choose? B. Tommie Usdin Mulberry Technologies, Inc.

Rockville Maryland United States of America

B. Tommie Usdin is President of Mulberry Technologies, Inc., a consultancy specializing in XML and SGML. Ms. Usdin has been working with SGML since 1985 and has been a supporter of XML since 1996. She chairs IDEAlliance's Extreme Markup Language conferences and was co-editor of "Markup Languages: Theory & Practice" published by the MIT Press. Ms. Usdin has developed DTDs, Schemas, and XML/SGML application frameworks for applications in government and industry. Projects include reference materials in medicine, science, engineering, and law; semiconductor documentation; historical and archival materials. Distribution formats have included print books, magazines, and journals, and both web- and media-based electronic publications. DTD XSD RNG W3C XML Schema RELAX NG XML schemas specify what tagging is allowed in a set of XML documents. Originally XML had only one way to express these rules; now there are many. What are they? What are the differences among them? When is one more appropriate than another? How should a user (or a project) choose which to use?

Introduction Schemas are a major component of virtually all XML applications. A schema can be considered: the model for one type or class of information — reference books, back transfers, journal articles, credit transactions, drug monographs a set of rules describing how documents of that type can be marked up an agreement on a common vocabulary (tag set) for an application. All XML schemas are written in a formal syntax, sometimes called a "constraint language" because they define constraints on what a set of XML documents may contain. Examples of XML schema languages include: DTDs, W3C XML Schema, and RELAX NG. XML schemas (XML document models) describe rules about elements, attributes, and (in the case of DTDs) entities. The schema may include rules about what may be inside each of these structures, how they relate to each other, and may specify datatypes (valid units of data). Schemas, and choice of schema language, seems to generate more heat (emotion) than any other aspect of XML application design. There are "experts" who sneer at anyone who still uses DTDs (that's old-fashioned). There are people who know that the only "serious" XML schema language is W3C's XML Schema; you know that because it's called "Schema" and the others are all called something else. There are people who sing the praises of RELAX NG, and suggest that all projects should be based on it because it is based on a clean underlying model. There are people who think it doesn't matter which of those models you use, as long as you use Schematron to make up for the shortcomings they all have. There are people who are making up new constraint languages for particular environments. There are even people who suggest that while DTDs may be unfashionable, they meet the needs of some applications, and thus should be used when possible because they are so my less expensive (in terms of labor and learning) than the newer options. The only absolute about XML schema selection I want to leave you with, and thus I start with it, is: Disregard any advice on schema language selection that is not based on a good understanding of your needs, your project, and your application environment. Just as one-size-fits-all socks don't fit all, one schema language does NOT meet all needs.

Functions of schemas Schemas serve several purposes in XML applications. We use schemas to: Provide "handles" for data Enhance data Drive tools Protect applications Identify unlikely content

Provide 'handles for data' By "Provide 'handles' for data" I mean provide the basic infrastructure of XML. XML Schemas allow us to: Name the things you can identify (in other words, name the elements and attributes) Provides names for these things (the tags) In XML it is far easier to manipulate data that has "handles" than data that doesn't. So, for example, it may be possible to find the year in an element than contains day, month, and year all mixed together, but it is far easier and more reliable to find the year if it is either identified as an element or an attribute; that is, if there is a handle on it.

Enhance data Schemas are sometimes used to add information to XML documents. This happens in two ways: Default values. DTDs can provide default values for attributes The "Post Schema Validation Infoset", produced by XSD validation software, includes information about data types and validation status. XSD validators can also provide default values for elements and attributes.

Drive Tools Many schemas are used to drive tools: Context sensitive editors Documentation tools Databases Content management Many of these tools work best (or work at all) with only one of the XML schema languages. And, in fact, many of then have documentation that is so biased toward one schema language that they often don't even acknowledge that other languages exist. I find that it is not unusual for organizations, or projects, to made product decisions. especially selecting editing tools, very early in the development process. Sometimes these product work only, or by preference, with one schema language or another. It is not uncommon to find entire project designs, including but not limited to selection of schema language, falling out of one (often poorly informed) tool selection.

Protect applications The most important function of schemas in many applications is to protect the system from data that might do harm. There are many ways in which data can harm systems; I have seen typesetting systems crash when presented with a footnote that contained a table which had a footnote. There are other environments where inappropriate content could cause destructive database updates. A schema can ensure that needed content is present and protect from situations that cause harm or crashes. If schema validation is part of your security system, if one of the functions of the schema is to protect your systems or your database, it is important the you have no surprises in your validation. (This is a warning: some schema validation tools can provide "partial" validation. This is a circumstance in which you don't want partial validation.)

Identify unlikely content Schemas are a key tool in identifying and correcting data errors. It is important to distinguish between data that is always an error and data that is highly unlikely but could be acceptable under some circumstances. In my opinion, most of the users who chafe at XML systems, and most failures of XML systems that include end users creating XML content, are caused by schemas that prohibit the unlikely, but occasionally correct content. This is because many system administrators confuse the "protect the system" function and the "alert QA to unlikely content" function. You want your system to reject harmful content, so data that is invalid according to a protective schema should be rejected. But, for example, your house style may say that lists should have two or more items. If your schema enforces this rule the technical writer who is trying to develop several parallel chapters, most of which have several "features" in the initial list of features will be forced to commit tag abuse if one chapter is about something that really has only one feature. Similarly, it is reasonable to suggest that if the unit of currency of a price is Yen that the value should probably be greater than 1000. But if you enforce that you will have a problem if someone in your system is selling individual carpet tacks!

XML schema languages - 2005 I categorize the XML schema languages in use today into two types: Document modeling languages, and Business rules checkers. Document modeling languages are for the creation of a complete representation of the document type. They allow one to specify, for one document type: all of the tags allowed (note: the schema specifies the "tag"s; elements are inferred from tagged objects, not specified in the schema) what relationships are allowed among the tagged elements (which are inside which, which are required, how many times the may occur, etc.) what attributes are allowed on which elements what content is allowed in the elements what content is allowed in the attributes In other words, document modeling languages are for modeling a complete document. Business rules checkers do not, generally, provide an overview of the document type. Instead, they specify particular rules that the documents must obey. These rules can be quite complex, and may apply to as much or as little of a document as needed. business rules checkers, unlike document models, can check interactions among multiple documents and can have access to things outside the XML document such as databases and authority lists. Examples of things that are often specified in business rules checkers are: The content of an element must appear in an external table, list, or file If the value of an attribute is one of a specified list, a specific element must appear the value of this attribute (or content of this element) must be greater than the value of that attribute (or content of that element) There may be no more that some number of instances of an element, in any context, throughout the document If this optional element is not present, that implied attribute must be present If one document contains something, another document must contain something parallel. Of the XML schema languages available today, I categorize them as follows: Document Modeling Languages XML 1.0 DTD (Document Type Definition) XSD (XML Schema Definition Language, i.e. W3C XML Schema) RELAX NG Business rules checkers Schematron special purpose languages such as BICS (Business Information Conformance Statement)

XML Document Type Definitions (DTDs) DTDs were the first schema language for XML; they were defined in the XML specification. They are primarily about the element structure of the documents, and only incidentally about attributes. DTD validation is all or nothing; either the document is valid or it is not. Validation with a DTD changes the content of the information set. For example, default attribute values are provided. Referential integrity between elements within a document is enforceable through attributes (ID / IDREF). But that's about all you can specify, and check with a DTD.

Objections to DTDs There are many objections to using DTDs as your XML schema language: The syntax is not XML document syntax (so you have to learn to read it) it requires a single-purpose processor (XML parser) Restricted modeling functions No context-dependent models No AND functionality No element data typing (therefore no type validation) No inheritance (except by convention xml:lang) Documentation only as comments Lacks other things RDBMS schemas provide: referential integrity, co-occurrence constraints, classing and derivation That Said, Many Organizations Use DTDs (2005), for good business reasons: Easy migration from earlier SGML Widely understood Easy to learn, read, write Tools are/were ubiquitous Breaks up the problem of validation use DTDs for what they're good for complete by supplementing with other methods Many of the advanced abilities (data typing and data types) are of limited use in processing narrative text

XSD (W3C XML Schema) XSD seems to be the schema-language-de-jure. Partly, I think this is because it was defined by the W3C, and many people/organizations prefer to use W3C specifications than other specifications. (I think this preference for W3C specifications will fade as there are more and more of them and they are proven to be of uneven quality, and as ISO standardizes more and more XML-related specifications - some of which did not start in the W3C. XSD provides nearly all of the functionality of DTDs, a great deal of additional functionality, especially datatyping, and uses XML document syntax for the document model. XSD operates over the XML document infoset, not over the XML document.

How XSD works W3C XML Schema does not operate over XML documents (XML documents are strings of characters and entity references). It assumes document has been parsed into a "tree" with all entities resolved. This "infoset" is composed of structures (nodes) of named types (strings, numbers, booleans, what-have-you). Validating with a schema produces a PSVI ( "post-schema-validation infoset" ); which consists of trees and their "data content". It can be provided with labels (type annotations), defaults (element and attribute) can be added, and validity or invalidity outcomes noted in the PSVI.

RELAX NG RELAX NG, pronounced as both “relax-N-G” and “relaxing”, was originally defined at OASIS and is now integrated into DSDL [ISO 19575: Document Schema Definition Language] (as Part 2: Grammar-based validation). In RELAX NG modeling is achieved with structural patterns with are not as complex as XSD modeling, go beyond XML DTDs, and can handle some constraints (especially useful in textual documents) that XSD cannot. RELAX NG oes not change the information set (there are no defaults or types added).It conceives of XML as text, also taking advantage of its tree structure. RELAX NG has two syntaxes (programmatically interchangeable); an XML one like XSD (for processing) a compact syntax (human readable).

Business Rules checkers Business rules checkers generally don't model complete documents; they model "checkable" features of documents. Sort of like spot checking; they look for a specific set of conditions and report on their presence or absence. One of the things users tend to like best about business rules checkers is that the person setting up the rules usually specifies the messages that report on the presence or absence of the specified conditions, which means that the messages can be in the vocabulary of the users.

Schematron Schematron does not define a document model, it defines rules based on path expressions. Originally a "meta-application" of XSLT, Schematron assertions are composed in XPath about what is to be expected or warned against. A generic stylesheet processes the assertions and returns a stylesheet which is run on the instance and generates a validation report. Schematron is currently being abstracted away from XSLT/XPath as part of DSDL (ISO/IEC 19757) (as Part 3: Rule-based validation - Schematron).

BCIS (Business Information Conformance Statement) IBM's BICS (http://www-128.ibm.com/developerworks/xml/library/x-bics20/) an example of a special-purpose validation tool. IBM says: The Business Information Conformance Statement specification (sometimes referred to as “BICS”) provides an XML vocabulary framework for declaring a constraint processing model across abstract constraint mechanisms. BICS enables various schema, constraint templates, type systems, etc, to be defined as a concrete constraint mechanism. A BICS XML document instance then contains instances of concrete constraint mechanisms within a constraint processing model, resulting in a comprehensive statement of information constraints.

Selecting Appropriate Constraint/Validation Languages There are a number of factors that should be considered in selecting a schema language: Nature of application the great data versus narrative text divide just validation or also authoring, query, etc. Suitability of particular modeling features (types, typing) Tradeoff between expressiveness and overhead / learnability Namespaces and inclusion of foreign XML vocabularies Ability to use XSLT to extract material from schemas Need particular feature in pipeline (PSVI, character entities, etc.) Many people have argued (loudly) that schema languages should be selected for readability. First of all, I wonder why they think technology decisions should be made based on the convenience of the developers; it seems to me that there are many more important selection criteria. Also, oddly, most of the people arguing for readable schema formats are arguing for XSD - certainly the bulkiest of the XML schema formats and in my opinion the most difficult to read. Yes, tiny demonstrations of XSD are easy to read without learning anything, but when scaled up to a model big enough for real applications most readers would far rather learn a compact syntax than slog through a multi-thousand line XSD. RELAX NG has a short form (RELAX NG Compact) DTDs are math-like and concise (and I can teach anyone to read DTDs in a half-hour) Nobody cares if Schematron is readable (but it is) XSD can be viewed through XML document tools (hierarchy diagrams, etc.)

So, How's a User to Choose? Don't! Do not choose one schema language. Especially early in a project, don't choose. Selecting a schema language for an organization or a project early in the planning stages (when I tend to see such decisions) is sort of like deciding, early in the planning stages for a house that you will use nails for this building; no screws, no adhesives, just NAILS! There is simply no need to do that; use the appropriate language for each function in your application. Don't allow the selection of a particular tool which may only work with one schema language dictate to all of your other tools what language to use. Remember the many functions of schemas in XML? Why would you think that the same specification language would be the most appropriate way to protect your systems and to identify content that it unusual and may be in error? Or that the same language is appropriate to drive all of your tools, much less drive your tools and validate your content so that other users of the content will know what to expect? Among the most important decisions in the architecture of an XML application is selection of schema languages. Use the most appropriate language for each function. You don't dance and climb mountains in the same shoes; you shouldn't drive an editor, identify odd content, and agree on data interchange structures with the same language.