Lost in the Semantics

Track: Core Technologies, Metadata and Semantics, Deploying XML

Audience Level: High Level/Technical view

Time: Tuesday, November 15 14:45

Author: Lucian Holland, DecisionSoft

Keywords: Data Representation, Internet, Markup, Metadata, Web Services, XBRL, XML Publishing


Recently I have been working on eXtensible Business Reporting Language (XBRL). XBRL turns on its head a domain that has traditionally been obsessed with presentation, by having accountants apply precise semantic markup to financials; what would once have been a carefully crafted document is now reduced to a set of tagged values. This provokes unease in certain quarters: financial professionals are being asked to let themselves be separated from the canonical representation of data, of enormous commercial sensitivity, by a technology abstraction. Where before they would have been able to read the official form of a company's accounts themselves, they are now faced with the prospect of relying on a piece of software to present it to them in readable form.

This set me thinking: is this not a problem for the whole XML community? Much work is currently being done on moving the web from a variety of presentational markup to a more semantic markup. In this paper, I will explore the question of whether this process risks leaving the human users of the content behind, and what we might do to address this problem.

In the first part of this paper I want to look at some examples of people publishing precisely, semantically, marked-up XML as a primary format on the web. There are still surprisingly few, but some of them are pretty influential (Amazon with Amazon Web Services, some of the big newswires using NewsML, for example). How do these content providers try and manage the relationship between a bare-bones "semantic feed" in XML, and something that communicates with a human audience? There are a few technology issues here, but they are, by and large, well understood; in a distributed, web services environment, the real issue is one of trust: if you are just publishing the raw data, what should you do to ensure that whoever/whatever shows it to a human user will do so in away that is not misleading?

The answer to this sort of question depends in part on the nature of the content and the consequences of misrepresentation; which brings me back, ultimately, to XBRL, and the issues of confidence from which I started. In the last part of the paper I look at the specific problems faced by XBRL in this area, as an example of a markup language that is entirely semantic, designed for machine-readability, and handling data whose misrepresentation can have serious consequences. I try to sketch what I see as the shape of a solution, and to draw some conclusions for publishing semantically marked-up XML more generally.