XML redbook

 
  Home Contact Site map Links
 



1.2 The Extensible Markup Language (XML)

This section introduces the Extensible Markup Language (XML) and related standards, the organizations behind this technology, and the current trends. This introduction does not cover the technical details of these standards.

1.2.1 World Wide Web document standards

The World Wide Web (WWW) became very popular and widespread in the past 5 years. Trends in communication networks, personal computers, and operating systems, as well as the underlying technology of the WWW, made it easy to access and use. The Web relies on two main standards, the HyperText Markup Language (HTML) and the HyperText Transfer Protocol (HTTP). Implementations of these standards can be found on all major platforms of today’s computing systems, from personal computers to mainframes, regardless the operating system used on them.

Web browsers implement the HTTP protocol to receive content from the Web and display the received HTML documents to the user. They became available on every platform and provide universal access and user interface to the http://www. On the other hand, Web servers were designed to deliver HTML files to browsers using the HTTP protocol. Due to the licensing policy of the first server and browser implementations (they were practically free to use, and the source code was also available), they soon became widespread.

HTML was designed to describe the presentational format of a document. It is text-based, therefore, a simple text editor is enough to create and edit HTML documents. Internet Service Providers (ISPs) introduced Web services to their customers to store and publish HTML files, and the WWW became a fast growing supply of information for the Internet community. As more and more users published their Web documents on the Internet, and software vendors introduced new features to their browsers, the HTML standard was extended by new, de facto components and features. However, it remained a limited standard in a way, because it concentrates on the presentation, and not on the content. The growing amount of data stored in HTML format, and increasing problems of handling this data, led to a proposal of a new document standard for the World Wide Web, the Extensible Markup Language (XML).

1.2.2 A brief history of XML

In 1969, an IBM team developed a document description language (the Generalized Markup Language, GML) to solve the problem of different document formats of various systems. GML formed the basis of many IBM documentation systems, including Script and Bookmaster. In the following years, the language evolved into the Standard Generalized Markup Language, which became an international standard (ISO 8879) in 1986 for the format of text and documents.

SGML is the document standard for big industries like airplane construction, automobile, and military. Its strengths are being an implementation- independent, generalized, structured, and extensible language. These features made it popular for companies that create, handle, and distribute large amounts of text data. In 1989, researchers at the CERN European Nuclear Research Facility developed a hypertext version of the SGML standard, called Hypertext Markup Language (HTML) to solve information sharing tasks within the organization. HTML inherited important features from SGML (such as being structured, implementation-independent, and descriptive), but it was also limited in many areas (it used a fixed set of element types, and it concentrated on the presentation). These limitations were necessary to make the language more simple for easy software implementation and editing. However, the growing amount of data stored in Web systems put these limitations into focus. The World Wide Web Consortium (W3C, the organization behind the Web standards) introduced several extensions to the HTML standard to solve its interoperability and scalability problems, but finally it decided to develop a new subset of the SGML standard, XML, for Web use. The Extensible Markup Language (XML) was developed to overcome the limitations of the HTML standard. It retains most of the features of the SGML standard, but makes it easier to implement and use in the World Wide Web environment. It became W3C standard in 1998. The initial draft of this new standard included ten key design goals, which are worth listing here:

• XML shall be straightforwardly usable over the Internet.

• XML shall support a wide variety of applications.

• XML shall be compatible with SGML.

• It shall be easy to write programs which process XML documents.

• The number of optional features in XML is to be kept to the absolute minimum, ideally zero.

• XML documents should be human-legible and reasonably clear.

• The XML design should be prepared quickly.

• The design of XML shall be formal and concise.

• XML documents shall be easy to create.

• Terseness is of minimal importance.

1.2.3 XML — a universal data format

XML is practically almost indistinguishable from SGML. It has almost all of the capabilities of SGML that are widely supported by implementations, but it also lacks some important capabilities of SGML that primarily affect document creation, not document delivery. That is because XML was not designed to replace SGML in every respect, but only on the Web.

While HTML is a single markup language, designed for a particular application, XML is really a family of markup languages: in fact, you can define any number of markup languages in XML. This means that almost any type of data can be easily defined in XML. So, in addition to a universal communications medium (the Internet), a universal user interface (the browser), we now have a universal data format — XML.

XML is universal, not only by its range of applications, but also by its ease of use: its text-based nature makes it easy to create tools, and it is also an open, license-free, cross-platform standard, which means anyone can create, develop, and use tools for XML. What also makes XML universal is its data description power. Data is transmitted and stored in computers in many different ways: originally it was stored in flat-files, with fixed-length or delimited formats, and then it moved into databases, and often into complex binary formats. XML is a structured data format, which allows it to store complex data, whether it is originally textual, binary, or object-oriented. To this day, very few data-driven technologies have managed to address all these different aspects in one package — except XML.

In markup languages, like XML, SGML, and HTML, the information pieces in a document are marked with beginning and ending tags. These markings identify the pieces according to the creator’s intentions (within the frame of the language). For example, in HTML, these tags instruct the browser as to how it should present the data between the marks to the viewer. In SGML and XML, these tags can do more: they can describe the information content held by the data between the marks. To define what is the meaning of the tags used in a document, these languages use the document type definition (DTD) description. Document creators can select or create a DTD they want to use to specify the meaning of tags. The DTD defines all elements used in a set of documents, and it also specifies the relation of these tags to each-other in a tree format. Each element has an element type name (tag name) and a set of attributes. Each attribute consists of a name and a value. In XML 1.0, element type names and attribute names are strings of (a restricted set of) characters, similar to identifiers in programming languages.

1.2.4 A short comparison of XML and HTML The problem with data available in HTML format is that it is formatted for people to view, and not for computers to use. HTML consists of a pre-defined set of tags, primary for viewing purpose. This makes it a language that is easy to learn and accessible, but since it only concentrates on the presentation, it is hard to reuse the data in HTML format.

This is where XML enters the picture. As its name indicates, XML is extensible, which means that you can define your own set of tags and make it possible for other parties (people or programs) to know and understand these tags. This makes XML much more flexible than HTML. In fact, because XML tags represent the logical structure of the data, they can be interpreted and used in various ways by different applications. See Figure 5.

Much of the value of the Web comes from re-using data. For example, one of the great success stories of the Web are the search engines. They work on the basis of a universal communications method (HTTP), and a universal markup language (HTML), to catalog Web pages. However, search engines work on very limited information, because only a tiny part of an HTML document is designed to be used by a search engine (new XML based initiatives such as Resource Description Framework will make search engines more powerful; see 1.2.7, “Metadata (RDF and PICS)” on page 21). Imagine how much more powerful search engines could be if the data that they search was stored in a simple, structured, re-usable format, that concentrates on describing the data, and not on the presentation.

A number of standards and specifications make XML more usable for the World Wide Web. The XML standard itself does not contain information about linking documents together, referencing to each other, using stylesheets, defining document access interfaces, and so on. The W3C is working on related specifications that make this technology more complete.

There are also related specifications that target particular application areas, and solve different tasks. XML, as a low level syntax for representing structured data, provides a solid basis for defining a wide variety of application-specific languages and standards. A number of XML-based specifications are now under development at W3C and other industrial organizations.

1.2.5 XML linking and addressing

There are specifications for linking and addressing XML documents. Linking means establishing relationships between objects; addressing means describing how to find linked objects.The XLink (XML Linking Language) specification describes constructs that may be inserted into XML resources to describe links between objects. It can describe simple links like those in HTML, as well as more sophisticated multi-ended, typed links. The basic form of address is a Uniform Resource Identifier (URI, RFC 2396), which is a more general form of resource location than the Web’s URL (Uniform Resource Locator, RFC 1738).

The XPath (XML Path Language) is a language for addressing parts of an XML document. It operates on the abstract, logical structure of an XML document. The XPointer (XML Pointer Language) specification supports addressing into the internal structures of XML documents. It is an extension XPath to address points, ranges, and nodes; to locate information by string matching; and to use addressing expressions in URI-references as fragment identifiers.

1.2.6 Advanced type definitions There have been several proposals to enhance the applicability of Document Type Definitions. In a situation where different types of DTDs can exist within the same application (or document), a simple character string may not be enough to specify tags used in a document. The XML Namespaces recommendation extends the data model to allow element type names and attribute names to be qualified with a URI.

Since the XML Document Type Definition is a restricted format (more restricted than the SGML DTD), there are limitations when it comes to defining complex relationships between data elements and their usage, especially when XML documents also use namespaces which might define elements conflicting with DTD declarations. To resolve these limitations, a new standard was proposed: XML Schema.

The purpose of a schema is to describe a set of document constructs to constrain and document the meaning, usage, and relationships of their constituent parts: datatypes, elements with their content, and attributes with their values. Schemas specify richer models for data typing and inheritance. Schemas strengthen the document modelling and validating capability of XML, and make it possible to express structural relationships between document types.

1.2.7 Metadata (RDF and PICS) Metadata, or data about data, provides information about data objects. This information can be used to label and catalog data. Metadata helps in organizing, locating, and understanding data. It provides a means to improve the productivity of the data management process. W3C's metadata activity is concerned with ways to model and encode metadata — developing RDF and PICS, two metadata standards.

• Resource Description Framework (RDF, http://www.w3.org/RDF) integrates a variety of Web-based metadata activities, including site maps, content ratings, stream channel definitions, search engine data collection, digital library collections, and distributed authoring. It is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of Web resources. RDF also provides a framework in which independent communities can develop vocabularies that suit their specific needs. The descriptions of these vocabulary sets are called RDF Schemas. A schema defines the meaning, characteristics, and relationships of a set of properties, and this may include constraints on potential values and the inheritance of properties from other schemas. A well-known RDF Schema is the Dublin Core, developed by the library community. RDF uses the idea of XML Namespaces to allow RDF statements to refer to a particular RDF vocabulary.

• Platform for Internet Content Selection (PICS, http://www.w3.org/PICS) consists of a set of specifications which allows people to assign labels to digital information for describing metadata. Labels contain information about the content in simple, computer-readable form. This information can be used according to user’s settings for filtering out undesirable material or directing users to sites that may be of special interest to them. While PICS has general applicability to labelling pages for a variety of metadata purposes, the PICS specification was originally designed to allow parents and teachers to screen out materials unsuitable for children using the Internet.

1.2.8 Domain-specific document definitions

There are specialized vocabularies (DTDs) for different application fields. These pre-defined DTDs define a common document structure for the given application field to ensure that different application vendors use the same element definitions for same purposes. There are DTD repositories where you can find definition files for your application field. There are only a few standardized document definitions at the time of writing this book, but it is expected that the number of these standardized DTDs will increase greatly as the industry leaders and standard bodies agree on them in various applications fields. Horizontal XML specifications, which are not tied to a single application field, contain information about several fields, for example, measurements, date and time, country codes, basic business forms, and descriptions of businesses and individuals.

Vertical XML specifications describe a specific application area like electronic payment, mathematics, or chemistry. For example, a specialized language for trading is the Trading Partner Agreement markup language (tpaML) proposed by IBM, which will be described in detail in 6.2.1, “Trading Partner Agreements” on page 185.

There are also collections of vocabularies like Open Buying on the Internet (OBI, http://www.openbuy.org), Open Applications Group (http://www.openapplications.org), Open Travel Alliance (OTA, http://www.opentravel.com), and RosettaNet (http://www.rosettanet.org).

The World Wide Web Consortium also redefines the HTML standard based on XML. This new document standard is called XHTML (Extensible HyperText Markup Language, http://www.w3.org/TR/xhtml1), that is now a W3C Recommendation. The XHTML 1.0 specification describes XHTML, a reformulation of HTML 4.0 as an XML 1.0 application. This aims to move HTML into the XML area while retaining its processing ability in standard Web applications. XHTML is intended to be used as a language for content that is both XML-conform and operates in HTML 4 conforming user agents.

1.2.9 XML in wireless applications Wireless application environments are domains where XML has already been playing an important role. These applications have special requirements for document formats and rendering due to their limited bandwidth and display capabilities. There are several XML-related proposals to fit mobile devices into the World Wide Web.

The W3C User Interface domain contains a Mobile Access Activity (http://www.w3.org/Mobile/Activity), that is working on protocols and data formats to ensure that provide an effective way to access the Web from mobile devices. They are working on presentation languages as well as on resource description formats that can represent the capabilities of these devices.

The Composite Capability/Preference Profiles (CC/PP) specification is currently a W3C Note. It is a user side framework based on XML and RDF that describes the capabilities, hardware, system software, applications, and user preferences used by someone to access the Web. These information might include the preferred language, sound on/off, images on/off, class of device (phone, PC, printer, or other), screen size, available bandwidth, version of HTML supported, and so on.

The Wireless Application Environment (WAE, http://www.wapforum.org) specification contains several documents that describe the application environment. It includes the WML, WBXML, WMLScript and its standard libraries, and finally, the Wireless Telephony Application Specification and its application interface specifications.

WML (Wireless Markup Language) is a markup language based on XML, and is intended for use in specifying content and user interface for narrowband devices, including cellular phones and pagers. WML includes four major functional areas:

• It includes support for text and image, formatting and layout commands.

• Deck/card organizational metaphor structures information in a collection of cards and decks.

• The inter-card navigation and linking supports the navigation between cards and decks.

• String parameterization and state management is possible in WML decks, using a state model.

WAP Binary XML Content Format (WBXML, http://www.w3.org/TR/wbxml/) proposed by Ericsson, IBM, Motorola, and Phone.com, defines a compact binary representation of the Extensible Markup Language. It is designed to reduce the transmission size of XML documents without loosing functionality or semantic information.

1.2.10 XML styling and transcoding

The Extensible Style Language (XSL) is a specification by the World Wide Web Consortium for applying formatting to XML documents in a standard way. XSL is based on the Document Style Semantics and Specification Language (DSSSL, ISO/IEC 10179), but it is simplified and designed for Web use, and also has document manipulation capabilities beyond styling. Actually, the XSL specification is divided into two main documents. The XSL Transformations (XSLT) specification describes how to transform one XML document into another, and an XML vocabulary specifies formatting semantics. An XSL stylesheet specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary.

There are also ways of transcoding XML documents into another document formats. Today’s typical application scenario for this type of document transformation is the server-based transcoding. For example, a Web server can be extended to process XML files on the fly, and present HTML files for Chapter 1. XML and e-business applications 25 its clients. This can also be tailored according to the client’s needs, for example, a WAP client from a wireless phone would prefer different document formats, than another from a home PC. The IBM WebSphere product family offers such document transformation using WebSphere Transcoding.

1.2.11 XML query languages

Query languages and tools help accessing information in XML documents in an efficient way. While XSL allows information access by filtering the required elements, it is very limited tool for this purpose. There are several proposals on extending the XSL standard with more powerful query and search capabilities.

At the time writing this book there was no standard technology for building and processing XML queries. The W3C created a draft recommendation on XML Query Requirements (W3C Working Draft, 31 January 2000, available at http://www.w3.org/TR/xmlquery-req), and there are several initiatives to formulate a standardized language.

The XML Query Requirements document identifies what properties an XML Query Language must, should and may have. It includes general requirements, like language syntax, declarability, protocol independence, and error conditions. The document also describes requirements for the XML Query Data Model, that relies on the XML Information Set. Finally, it discusses the details of query functionality, like operations, quantifiers, document part combination, aggregation, and handling document structures.

Two example initiatives of describing an XML query language are XML-QL (http://www.w3.org/TR/NOTE-xml-ql/), and XQL (http://www.w3.org/Style/XSL/Group/1998/09/XQL-proposal.html). The XML-QL proposal draws from database technology traditions. It is more concerned with large repositories, integration, creation of new views of existing data, and transforming data into common data-exchange formats. The XQL standard (from the document community) is more concerned with integrating full-text and structured queries, describing the structured search, and creating multiple presentations from a single document.

1.2.12 Processing XML documents

At the heart of every XML application is an XML parser that processes an XML document, so that the document elements can be retrieved and transformed into data that can be understood by the application and task in hand. The other responsibility of the parser is to check the syntax and structure (validity and well-formedness) of the document.

Anyone has the freedom to implement a parser that can read and print an XML document. The XML 1.0 Recommendation defines how an XML processor should behave when reading and printing a document. There are several parser implementations, for example the IBM XML Parser for Java (now donated to the Apache Group, and continued under the name Xerces). The parser can be used as part of an application, which wants to extract data from or put its own data into XML format. To provide this functionality parsers specify an application programming interface (API). The XML Recommendation does not specify this API, therefore it is up to the parser’s designer to specify and implement this interface.

Currently, the following two APIs are widely used:

• Simple API for XML

• Document Object Model

Simple API for XML (SAX) was developed by David Megginson and a number of people on the xml-dev mailing list on the Web, because a need was recognized for simple, common way of processing XML documents. As such, SAX 1.0 is not a W3C recommendation, but it is the de-facto standard for interfacing with an XML parser, with many commonly available Java parsers supporting it. SAX is an event-driven lightweight API for accessing XML documents and extracting information from them. It cannot be used to manipulate the internal structures of XML documents. As the document is parsed, the application using SAX receives information about the various parsing events. The logical structure of an application using SAX API with the parser is shown in Figure 6.

The SAX driver can be implemented by the XML parser vendor, or as an add-on to the parser. That makes the application using the parser via SAX independent of the parser.

The Document Object Model (DOM) API is a set of interfaces that must be implemented by a DOM implementation such as IBM’s XML Parser for Java. The interfaces, being originally described in the interface definition language (IDL), form a hierarchy (see Figure 7).

The root of the inheritance tree is Node, that defines the necessary methods to navigate and manipulate the tree-structure of XML documents. The methods include getting, deleting, and modifying the children of a node, as well as inserting new children to it. Document represents the whole documents, and the interface define methods for creating elements, attributes, comments, and so on. Attributes of a Node are manipulated using the methods of the Element interface. DocumentFragment allows extracting parts of a document. It should be noticed that while a DOM application reads an XML document and an object representation if formed, that representation remains only in memory. Changing a DOM object in memory does not automatically modify the original file. That is something an application program has to do for itself.

Some important facilities that are missing from the DOM Level1 Recommendation are being defined in DOM Level 2, which is currently a W3C Candidate Recommendation (10 December, 1999). The added functionality in Level 2 contains interfaces for creating a document, importing a node from one document to another, supporting XML Namespaces, associating stylesheets with a document, the Cascading Style Sheets object model, the Range object model, filters and iterators, and the Events object model.

There are certainly applications that could use either SAX or DOM to get the necessary functionality needed when processing XML documents. However, these two approaches to XML processing each have their strengths and weaknesses.

SAX provides a standardized and commonly used interface to XML parsers. It is ideal for processing large documents whose content and structure does not need to be changed. Because the parser only tells about the events that the application is interested in, the application is typically small, and has a small memory footprint. This also means that SAX is fast and efficient, and a good choice for application areas such as filtering and searching, where only certain elements are extracted from a possibly very large document.

Because the events must be handled as they occur, it is impossible for a SAX application, for example, to traverse backwards in the document that is under processing. It is also beyond SAX’s capabilities to create or modify the contents and internal structure of an XML document. Because every element of an XML document is represented as a DOM object to the application using the DOM API, it is possible to make modifications to the original XML document. Deleting a DOM node means deleting the corresponding XML element and so on. This makes DOM a good choice for XML applications that want to manipulate XML documents, or to create new ones.

DOM is not originally an event driven API like SAX, even though the DOM Level 2 draft specifies events. To extract even a small piece of data from an XML document, the whole DOM tree has to be created. There is no way of creating lightweight applications using DOM. If the original XML document is large, the DOM application that manipulates the document requires a lot of memory. In practice, DOM is mostly used only when creating or manipulating XML documents is a requirement.

There are other initiatives to specify application interfaces to XML documents in various environments. There are native APIs for different programming platforms, like Pyxie for Python (http://www.pyxie.org), or the Java API for XML Parsing (JAXP, java.sun.com/xml), and XML components for various applications like DB2 XML Extender.

For more information about XML support in IBM products see Chapter 3, “XML in the IBM Application Framework for e-business” on page 59. For a more detailed description about DOM and SAX, and application examples, read “The XML Files: Using XML and XSL with WebSphere 3.0” by Luis Ennser, Christophe Chuvan, Paul Fremantle, Ramani Routray and Jouko Ruuskanen.

1.2.13 Organizations concerned with XML There are standard bodies, civil organizations, and industrial groups and companies behind the specification of XML and related standards.

The ISO (International Standards Organization) handles several related standards, like the SGML (http://www.oasis-open.org/cover/sgml-xml.html), the HyTime (http://www.hytime.org), DSSSL (http://www.jclark.com/dsssl) and Unicode (http://www.unicode.org).

The World Wide Web Consortium (W3C, http://www.w3.org), founded in 1994, is an international industry consortium, that issues Recommendations (as they call their standards) on XML, XSL, XPath, Namespaces, MathML, and other related technologies. It also issues Proposed Recommendations (Recommendations before the W3C Advisory Committee reviews them), Candidate Recommendations (published for external review), and Working Drafts (submitted for review by W3C members).

The largest industrial consortium to promote the structured document management technology (SGML and XML) is the Organization for the Advancement of Structured Information Standards (OASIS, http://www.oasis-open.org). It is a non-profit, international consortium of users and suppliers (including IBM) whose products and services support SGML and XML. OASIS operates XML.ORG, the a global XML industry Web site featuring an XML registry and repository that offers automated public access to XML schemas for electronic commerce, business-to-business transactions, and tools and application interoperability. The annual SGML/XML 'XX Conference and the corresponding SGML/XML Europe Conference are co-sponsored by OASIS (together with the Graphic Communications Association), as are other major SGML/XML events.

CommerceNet (http://www.commercenet.com) is a non-profit global membership organization whose mission is to “promote and advance interoperable electronic commerce to support emerging communities of commerce”. It runs several research projects about XML topics.

BizTalk (http://www.biztalk.org) is a Microsoft initiative whose goal is “driving the rapid, consistent adoption of XML to enable electronic commerce and application integration”. It defines the BizTalk Framework™, a set of guidelines for how to publish schemas in XML and how to use XML messages to easily integrate software programs. It runs independently from other industrial organizations like OASIS and CommerceNet.

1.2.14 Typical applications

In this section we present three examples of how XML is being used in real-life to bring benefits to people, businesses, and organizations. For more examples, see the XML section in the IBM developerWorks site at: http://www.ibm.com/developer/xml

XMLSolutions extends EDI applications to non-EDI partners XMLSolutions XEDI Translator integrates existing electronic data interchange (EDI) systems to XML-based system, providing both X12 and EDIFACT translations to XML. This technology can dramatically reduce the overall cost of electronic commerce systems. This technology enables non-EDI trading partners to participate in EDI transactions, thus reducing trading costs. A majority of the Global 2000 companies have between 10,000 and 40,000 trading suppliers, 80% of those have not implemented EDI trading. By using XMLSolutions XEDI Translator this majority of the trading partners can also participate in electronic transactions. IBM WebSphere, WebSphere Studio, and VisualAge for Java are used to implement XMLSolutions products.

Navant Corporation develops Web of Knowledge using XML

The Web of Knowledge e-business platform has built-in capabilities for building enterprise portals, online communities, performing online education, and webcasting. It addresses several business challenges including business integration, employee education and collaboration, Intranet management, online commerce, and online Internet content. Web of Knowledge uses XML for its business object persistence, site content markup, message exchange, template-driven customizable interfaces, client-side data islands, and configuration. It uses XSL for server-side data transformation and filtering, and for client-side filtering. Using the XML standard guarantees that the data will be interoperable between business applications. Web of Knowledge is based on IBM’s WebSphere solution and DB2 database system as the preferred infrastructure to be used in support of the product.

SABRE and Wireless Markup Language

The SABRE Group is one of the major distributors of international travel services, offering electronic travel bookings through travel agents and on the Web worldwide. They are transforming their travel information into XML using a Java application, and then allowing mobile phone users worldwide to look up, reserve, and purchase travel from a mobile phone. The XML is automatically translated from XML into Wireless Markup Language, which is a standard for building applications on mobile phones. The benefits of XML to this application are its extensibility, the speed of development, and the ability to build a standard repository in XML, and translate as needed into a particular environment, in this case the mobile phone.