Metadata on the Web

Why use Metadata

Web technologies are rapidly becoming a key method of building business applications. HTML and HTTP are being used to transfer critical data from business to business. However, HTML is unstructured data, and requires considerable processing to transfer into databases. Tags only describe how data is to be displayed, not what the data is.

Whilst a new generation of database query tools allow common database connection protocols to interrogate HTML pages, dealing with the information is a complex problem - similar to working with raw text. HTML is a descendent (in fact, more specifically a DTD- Document Type Data) of the SGML international standard mark-up language, which allows users to create structured documents that contain metadata: information about information. Using metadata it is possible to take a structured text document, parse it, and deliver the content to a database or to an application. SGML isn’t a new technology, and it has formed the basis of many bibliographic systems.

If the web is to be used to link businesses and business systems, then metadata will become increasingly important. Companies will have different ways of displaying information, and metadata allows development of translation system. The World Wide Web Consortium (W3C) has been working to produce a specification for allowing structured information to be distributed over the Internet, and has recently released the specification for XML 1.0 - the eXtensible Mark-up Language.

XML on the Web

XML, like HTML, is based on SGML, but has some significant differences that make it closer to the original SGML specifications. Whilst SGML allows documents to define their own grammar, providing embedded rules for describing document layout, so that SGML documents can be designed for specific purposes, by defining the tags used in the documents.

An HTML document differs from a standard SGML in several major respects, as it is based on a frozen SGML document description. This means that the document definition isn’t included with every HTML document, keeping the document size down and making them more suitable for use over a distributed network. However it does mean that the author of an HTML page is unable to take advantage of three of the key features of SGML: extensibility, structure and validation. Web application developers can’t add their own tags to HTML pages and expect them to be parsed by web browsers. There’s also no deep structure in HTML that allows pages to be used as database schema, and it is difficult for external applications to validate the data as the HTML tags are loosely defined.

The development of the World Wide Web as a business-to-business communications medium has lead to a requirement for HTML-based documents to be used to transfer structured data. The W3C has set up an SGML Working Group to develop a system of delivering SGML-style self-describing documents using the HTTP protocol and standard web servers. One main requirement is that the documents produced can be as complex as required by the applications that will use them, rather than setting limits in the document definition.

The result of the SGML working group’s discussions is XML. Like SGML, XML is extensible, structured and can be validated, but it’s also easier to use and implement than SGML. Using XML, it’s possible to define new tags and attributes as and when they’re needed, and to nest them in complex document structures. It’s also possible to add a description of the page grammar, so that external applications can validate a document’s structure before using the data.

XML parsers are now available for most languages, with the components provided in 5th generation browsers ready for use. Microsoft’s Visual Studio 6.0 is designed to take advantage of the Internet Explorer XML parser, and will be able to use the new XML engine bundled as part of IE 5.0. Java is an ideal environment for working with XML, and Beans and class libraries for handling XML are already available.

The Document Object Model

Building web applications using XML and SGML needs something more than plain HTML. The release of the 4.0 browsers from Microsoft and Netscape introduced a document object model, that allowed complex scripting of the content of web pages.

As well as HTML, the document object model (DOM) is designed to work with XML tags. The W3C has defined a high-level specification for the DOM, which is designed to be language neutral and platform independent - though practically there are differences between Microsoft’s and Netscape’s implementations, and normally ECMA-252 compliant scripting languages like JavaScript 1.3 and JScript are likely to be used. The DOM is also designed to be secure, and to prevent interactions on one page affecting another. Initially the security model used by the DOM is a “sandbox”, similar to that used by Java applets, though the W3C expects future releases to have a more complex security model.

The W3C’s specification for the DOM includes a series of high level requirements. These allow external applications to work with pages, as well as using browser-based scripting languages. The intention of the DOM specification is that it should be the same on all browsers, though currently this isn’t the case. The recent release of the developer preview of Netscape Communicator 4.5 has added some support, but there won’t be a reliable cross-platform DOM until both Netscape and Microsoft release their 5.0 browsers.

Using the DOM, an HTML/XML document becomes an object-oriented application, able to manipulate it’s own content. Microsoft have used this as the basis for their DHTML dynamic web page system. As well as HTML, the DOM also applies to CSS definitions, and forms the basis for the DHTML behaviours introduced in Internet Explorer 5.0. It’s also possible to use the DOM in external applications to enable them to parse HTML documents. Tools to handle this are included in Visual Studio 6.0, and Java and Perl libraries and classes are also available.

The DOM leads to more flexible HTML documents and, specifically, to DHTML, as it requires all the document content - including HTML tag elements and attributes - to be able to be accessed, and manipulated by programs. This allows developers to change HTML elements and attributes in a page, without requiring the page to be reloaded. It’s also possible to use the DOM to modify CSS stylesheets on the fly, as well as inline styles applied in a document.

In order to allow the creation of interactive documents, the DOM also includes an event model to allow interactions between elements - usually triggered by a user action. Whilst the event model is very powerful, and allows every element on an HTML page to both generate and respond to events, it does give developers some problems. These are due to its synchronous nature, and the fact that events are also dependent on the structure of an HTML document, and can even be generated before a page is fully rendered by a browser. The DOM used in IE 5.0 provides mechanisms for handling these problems, but DHTML developers still need to take care.

RDF: metadata about metadata…

Metadata in documents isn’t enough to build a generic document interchange facility out of HTML and XML documents. It’s important that documents themselves are described, so that an invoice is treated differently from a personal web page, even though they may use the same XML and HTML tags.

The solution to this is the Resource Description Framework. A descendent of the PICS content rating technology, RDF is designed to be a general “knowledge representation mechanism” for the web. HTML documents containing RDF information are able to be processed by external applications, allowing search engines and catalogues to perform more effectively, as well as allowing XML-enabled applications to determine whether a document is designed to be used by the application. RDF can also be used to allow applications to exchange digital signatures - enabling secure, trusted electronic commerce. RDF descriptions are produced using XML, and an early example is the CDF channel definitions used by Microsoft and PointCast to describe how a browser should handle “push” data.

An RDF document description is composed of nodes, containing attribute/value pairs. Each node can be any web resource, or even further RDF information. RDF isn’t intended to have a set grammar and vocabulary, and the W3C expect that standard vocubularies will eveolve. Already ratings and library vocabularies are being developed. A full RDF standard is currently still some way off, but significant work has been done by the W3C metadata working parties.

Using XML in Web Applications

XML may seem to be a universal panacea, but it’s often difficult to see how it can be used in practice. The following scenarios show how XML can be used to link web applications and complex business systems, using metadata to reduce the complexity of the required interfaces.

Scenario 1: Orders and Invoice System

The traditional role of EDI is in enabling businesses to exchange information - usually orders and billing information. XML-enabled web applications can also be used in a similar manner to provide ready formatted electronic documents.

By using a web application to generate XML pages, a company’s orders can be held on a secure web server - either as pre-generated static pages, or generated automatically when a supplier or customer visit the site. By using an XML-aware browser or application, these pages can be downloaded directly from the server into business systems, and responses uploaded using the HTML binary-file upload facilities provided by 4.0 browsers, and supported by the current generation of scriptable web servers.

Using this form of XML-driven application, it’s possible for businesses to take advantage of the facilities provided by EDI, without having to invest in the infrastructure a standard EDI system requires, by using the Internet.

Scenario 2: Airport Flight Information Systems

One of the most information-rich environments is the airport, and it’s also one of the most demanding - as information needs to be current, and is required by many different systems. XML can be used to simplify these systems, in conjunction with Java applications and a true “push” environment.

By using XML to define flight information, it will be possible for information to be distributed over an open network via an IP multicast system. XML encoded information can then be parsed at viewer terminals using a publish and subscribe model, allowing display systems to tailor information to specific audiences. Using this mechanism, central display boards can carry full flight details, whilst local boards could display airline or gate specific information, at different levels of detail.

Scenario 3: Bid and Offer brokerage systems

One of the more interesting electronic commerce models is the concept of bid and offer systems. Instead of the traditional brokerage system, where a broker finds the best deal available for a client, from a range of different suppliers, the client states their requirements, which are then provided to a selection of suppliers, who make their responses. This model is especially attractive in the travel industry, where a client can specify their flight requirements (including cost!), and receive a selection of different responses from which they can chose the flight closest to their requirements.

XML allows a broker to build a web application that automates the process of passing requirements to airlines, by delivering an XML-formatted message to the airline enquiry system. This can then be translated, and a response created, and returned as an XML document. These can then be processed and displayed to the client. Once a decision has been made, a further XML message can be sent containing confirmation details and payment information.

IE 5.0 and XML, the heart of Office 2000

Microsoft’s 5.0 browser is designed to take advantage of XML and CSS, and the recent developer preview introduced XML related core features - including the powerful DHTML behaviours scripting feature. By linking client-side scriptlets to CSS stylesheets, web developers are able to add new object-oriented behaviours to existing HTML elements and to new XML elements.

As the document object model can be accessed by any scriptlets associated with a document, it’s possible to pass events and values between different behaviours. Using XML tags to create metadata, and CSS to define how it is presented, behaviours allow an HTML/XML document to process data, so that dynamically generated HTML pages can act as a user interface to web applications, rather than using Java applets. By allowing XML tags and behaviours to be used in HTML documents, IE 5.0 finally splits the web application development team from the web design team. Instead of each page (especially if it contains scripts) in an application requiring the attentions of both teams before being able to be used, reusable external components and CSS allow designers to lay out a page, without affecting scripts that are called by extra attributes on an HTML or XML tag. This is a two way street, as designers can add their own attributes and request that scriptlets be produced to implement their requirements.

The next release of Microsoft’s office applications suite, Office 2000, is designed to take advantage of XML and DHTML behaviours. As well as using Microsoft’s proprietary file formats, Office 2000 will save files as HTML, with embedded XML. An Excel 2000 spreadsheet will appear identically in Internet Explorer 5.0 - including active content such as tooltips - using XML and behaviours to hold metadata information, and to implement active content. It’s possible to edit the resulting HTML/XML document in a text editor or HTML editor (including changing or adding new XML metadata), and the changes will appear when the file is loaded into the original application.

The use of XML in Office 2000 is designed to take advantage of the use of Internet technologies in the modern enterprise. The Intranet is rapidly becoming the standard method of delivering one-to-many communication in large and small organisations, and a document format that doesn’t require every desktop PC to have a copy of the original application, or a specialised viewing tool, is ideal in this environment - especially when it avoids the cost of the tools currently required to deliver digital paper documents.

The way ahead

The rapid growth of the World Wide Web has seen some innovative and adventurous solutions to the problems that result from using an unstructured data-format for complex business transactions. The development of XML has removed a lot of the complexity required, by introducing an SGML-based extensible metadata format that allows application developers and web page designers to work together to produce rich application-aware documents, which are both machine-readable and can be displayed attractively in web browsers.

By developing XML as an extensible document format, based on open standards, the W3C has produced a truly common file format, which promises to replace the proprietary systems that have required application developers to produce complex translation tools. Even the notoriously proprietary Microsoft have seen the advantage of this, and Office 2000’s XML-based file format will allow documents to be shared of corporate Intranets without requiring IS teams to roll out Office to every desktop.

With XML parsing engines appearing in web browsers as reusable components suitable for use in external applications, and with the next generation of development tools beginning to appear, XML will become a crucial business tool. Instead of requiring complex interfaces to be written between organisations business systems, the XML document description can be published, allowing information to be exchanged without requiring massive investment.

 

Metadata on the Web
Home
Columns