XML Basics

XML Basics

Learning Objectives
    What You Will Be Able To Do Upon Completion

Reading
    Understanding XHTML 1.0
    XHTML Foundations
    XHTML in the Real World

Assignment 1
    1.1 The Power of XML
          Introduction
          XML as Both a Data Storage/Exchange Format and a GUI Format

    1.2 Web Development Then and Now
          The Rise and Fall of Dynamic HTML
          Separation of Code from Data

    1.3 Introduction to XML Syntax
          What is XML Markup?
          What is an XML document?

    1.4 Parts of an XML Document
          Introduction
          Elements
          Attributes
          Empty Elements

    1.5 More on XML Syntax
          Nesting
              Properly-nested elements
              Misnested elements
          Root Element
          Non-root Element
          White space

    1.6 Requirements of Well-formed Documents

Learning Objectives

In this unit, we will cover the basics of XML so we can learn them before comparing XML languages to HTML and understanding how XHTML fits into the picture in the next unit.

Upon completion of this unit you will be able to:

understand, at an introductory level, how XML is used to share data more effectively between software applications
describe XML's role in relation to the web-based technologies you are already familiar with
explain how XML is used as document-based middleware
explain how XML is used as a graphical user interface format
define elements and attributes, and describe how they structure an XML document

Reading

Understanding XHTML 1.0 p.9-22
XHTML Foundations p.23-34
XHTML in the Real World p.35-48

Assignment 1

Download IE6 and start opening up XML files. Use the plus and arrows to open and close the different elements of the XML files.

XML File in Internet Explorer 6.0

1.1 The Power of XML

Introduction

The Web is an interwoven "network" comprised of stored data and software applications. Today's web sites must provide data to an increasingly wide variety of software applications. Literally everything that takes place over the Web involves some form of data exchange.

Examples include:

providing a username and password to a remote server when logging on.
sending and receiving e-mail.
uploading a document to the Web.
accessing a web page.

Of course, we are not really exchanging data directly. Software programs continuously send and receive data on our behalf. Although we can often understand information in an unfamiliar or unexpected format, machines cannot start processing information until they receive the data in the exact format they are "expecting."

XML's primary objective is to provide a standardized data storage format capable of being "understood" and processed by all software programs.

Traditionally, converting between proprietary file formats without losing data is an uphill battle for a developer that must be fought again and again every time software is replaced or upgraded. The Web's software development cycle is continually creating new and better kinds of software. The primary challenge for a web developer usually isn't installing and configuring the new software, but trying to convert data from the old proprietary file storage format into the new one. If there were a single, standardized data storage format that could be implemented by all software applications, these conversion “headaches” would be the exception and not the rule.

XML markup provides a standardized key for data sharing between applications.

The requirements of software applications are in constant flux, so it only makes sense to store your data in a format that is easiest for software to access and process. This way, software may come and go, but your data will live on forever. Since it will be a few years before all software is XML-enabled, the current implementation strategy is to integrate XML documents into existing application systems in application "layers" that enable XML data to be imported and exported “on the way in” and “on the way out” of software applications.

Storing your content in XML means:

You will be able to share data more easily within your own organization.
You will be able to share data more easily with outside systems and applications.
Your data will never be tied to one particular OS, programming language, development platform or software application.
Your data can be easily integrated into the "next big thing."

XML documents can be exchanged between applications using the Web's existing messaging framework and used in a variety of ways on the client, server, or some combination of the two. Since XML documents are just ASCII text-based documents (like HTML documents), transporting them over the web to another application is a breeze.

XML documents can also be:

stored on an HTTP server and accessed via a URL (by an end user or an application).
attached to an email.
transported over FTP or telnet.
accessed over a file directory system at the OS level.
Accessed over any protocol able to transport text documents.

XML is more than just a structured data storage format. XML application development involves new ways of constructing “virtual” software applications by connecting together existing software applications over the Web.

XML as Both a Data Storage/Exchange Format and a GUI Format

XML makes it easier to continually expand, scale, and add new elements to your web site, while still keeping your future options open.

In the same way that documents can have a structure, our web applications have a structure (or "infrastructure") that's also sometimes referred to as an "application architecture." A web site needs to continually expand and add new features and services in order to fulfill the needs of its audience. An XML development strategy is based on the notion of integrating "new" features into your site by simply tapping into the existing applications of other sites.

You want your web site's applications to be able to interface easily with other applications that may be larger or smaller in comparison. By thinking about software applications in terms of their smaller parts, it becomes easier to understand the individual pieces you have to work with. Smaller software applications can function seamlessly within a larger application's existing infrastructure. (An HTML parser is an example of a self-contained software application that also works as a component within the larger browser application.)

XML documents can provide a machine-readable description of the goods and services of a web site that simplifies the process of interfacing with the applications and services of others.

HTML documents are used as the graphical user interface (GUI or "goo-ey") for web-based applications to provide a "front-end" for the end-user to interact with. This HTML "front end" is a given for any web-based application, but what you decide to connect together on the back end is up to you. XML can be used to connect together a host of features and services on the "back-end."

Instead of re-inventing the wheel every time you want to add a service or a feature to your site, you can take advantage of the work and experience of others by connecting directly to their service from your own front end. In other words, your HTML-based front-end might connect to a back-end service that actually "lives" on a server other than your own. There's no way for an end user to know where the features of your website actually "live." This technique is useful because it enables you to build upon the work of others by incorporating existing services into your own front end, rather than writing all of your own features from scratch.

1.2 Web Development Then and Now

The Rise and Fall of Dynamic HTML

In the early days of the Web, developers were at the mercy of the browser companies to incorporate features that could be used on their web pages. What started as browser "innovations" often led to some kind of non-standard HTML "feature creep." We learned many lessons during this formative period, but most of the little lessons were part of the same big one: browser-specific applications are unpredictable and ultimately useless. We don't want to develop browser-specific applications anymore. We want the “mom and pop” that just dialed up using that old AOL version 3.0 disk to still be able to buy stuff on our web site. One of the important “givens” of the Web is that it is impossible to predict what brand and version of browser your web-site visitors will be using.

"Browser detection scripts" are often employed that use a server-based scripting language to detect what version was in use and then serve the appropriate HTML page down to the end user. However, even if you can detect a browser type successfully, maintaining two or more versions of your HTML content and browser-specific JavaScript will quickly become quite a juggling act. Another shortcoming of this method is that there is a greater potential for errors to be introduced, since both the “content” people and the “programming” people will be editing the files, and there is more than one version of each file. The content people, in the course of their work, could easily erase a character from the embedded JavaScript and disable the file. The developers could accidentally erase a sentence and disembody the content.

At this point, there are so many different versions of the three major browsers (IE, Netscape, and AOL) that even a browser-detection script provides an application with only an educated guess about the brand and version of the end-user's browser. Most all brands and versions of browser are able to process HTML 4.0 and JavaScript 1.1 correctly, so it's a good idea to use them as a base development platform.

Separation of Code from Data

As an XML developer, it is important to develop the ability to isolate your data from your code whenever possible and keep this separation in mind when you are designing the structure of your applications. When the presentation and processing of information are embedded within the same document it becomes more difficult to manipulate that data on its own.

The first step to developing useful XML software applications is to modularize your software application into its smaller, logical components.

The outcome of the process of breaking every software process into its smaller components, or modules, is that you will isolate your data from the code used to access and process it.

The methods used to store, access, and process your data don't necessarily have to depend on each other. This is a far cry from only a few years ago when processing data intelligently meant depending on expensive proprietary software and storage formats, and requiring your business partners to do the same. Before you go on, review the following table, which summarizes what we've learned so far about how web sites and applications used to be designed and how we want to design them now, keeping our new XML perspective in mind.

Web Development Then and Now

Old Web	New Web
Browser-specific web sites	No Browser-specific web sites
Dynamic HTML (Non-accessible web sites)	No Dynamic HTML
Code (content and presentation information) intermingled with data (stored content)	Data kept separate from its presentation and processing code (externalized into "modules")
Dedicated software on the Client/Server applications	No dedicated client/server applications: develop only for a "web-based" HTML front-end

Now that we've learned what XML can do, let's start to learn how to use it.

1.3 Introduction to XML Syntax

What is XML Markup?

All XML markup actually "does" is provide a simple format for naming and structuring text-based data.
XML documents are made up of character data (content) and markup (code).
There are five "special characters" (<, >, &, ' and ") that an XML parser will interpret to be "markup" rather than character data.
The angle brackets inform the parser which characters constitute structural markup (code) and which constitute character data that should be processed (content).

What is an XML document?

XML documents can be used to represent any type of text document.

Let's think for a minute about the kinds of documents we are used to seeing everyday: written letters, books, pamphlets, newspapers, magazines, etc. Conceptually, many of these paper-based documents transfer very easily to their digital counterparts, and any digital text document can be generated using XML.

TEXT	DIGITAL
letter	email
Book	E-book
Pamphlet	Web page
Newspaper	Online Newspaper
Magazine	Online Magazine

1.4 Parts of an XML Document

Introduction

Every XML document has a data model comprised of the elements and attributes that are required or allowed to structure its content (character data). We'll take a look at those elements and attributes in this lecture topic. In the same vein as the data model, each element has a content model made up of the elements and attributes that a particular element is allowed to contain. Don't worry if you don't understand this structure yet; read on.

Elements

Elements are the logical components of XML documents. When all of our documents are abstracted into smaller parts, we can manipulate their content from whichever perspective we require. The smaller parts of our larger documents can be represented in XML using "elements." A "header" element, for example, could be used to group together the "to", "from" and "subject" elements of an "email document."

Elements are one of the most commonly used types of markup: the bracketed items that are often referred to as "tags." Elements consist of words that serve as the "names" for your element "tags" and are surrounded on either side by "less than" (<) and "greater than" (>) characters. These start and end tags may be used to encapsulate character data (text), as in the following example.

<summary>Text goes in here</summary>

Besides character data, an element may also be made up of subelements. In the graphic below, the "book" element's content model consists of the "summary" subelement, while the "summary" element's model contains no subelements, only character data.

Element content vs. character data

Attributes

Attributes provide a means of assigning "extra" information to elements in order to further describe properties of those elements.

Attribute-value pairs can be associated with elements by including them inside of an element's start tag.

An attribute-value pair used within the "book" element's start-tag.

<book year="1986">
<title>Old Yeller</title>
</book>

Empty Elements

If an element contains no subelements or character data, that element is said to be "empty." In most cases, an empty element will contain an attribute-value pair inside of a single tag that is "terminated" by a forward slash before its closing bracket. The slash before the ending bracket serves the same function as an end tag's forward slash.

An element containing nothing more than an attribute is still considered "empty" and "without content" because attribute values count as markup not character data.

<Book year="1986"/>

Technically, an empty element can also be expressed using element start and end tags.

<Book year="1986"></book>

1.5 More on XML Syntax

Nesting

Unlike HTML, elements in XML must have both starting and ending tags, and its markup must be nested properly. Nesting refers to placing the contents of an element inside another element ("nested" subelements). This means that a subelement's end tag must occur before that of its parent element's.

Properly-nested elements

<book><title>My Life</title></book>

"Child" subelement tags cannot "overlap" with those of their parent elements. The example below would produce an error because the <title> subelement's closing tag occurs after the closing tag of its <book> parent.

Misnested elements

<book><title>My Life</book></title>

Caution! XML syntax is case-sensitive. Both of the examples below would qualify as variations on the string "n-a-m-e" and would be interpreted as unique by an XML parser, triggering an error.

<name>Blaster</Name> <Name>Blaster</name>

Root Element

The very first element of a document is known as the root element.
The root element is the top-level element in the XML document hierarchy.
The root element contains all other elements. Each document can have only one root and all other elements must be nested within it.
For instance, the code below would produce a parsing error because an XML parser would think that the first <book> element was the root element - it was the first element it came across.
When the parser recognized a second occurrence of a <book> element, it would be able to determine that the document's elements were not nested correctly.

<Book>
<title>Tom Sawyer</title>
<author>Mark Twain</author>
</book>

<book>
<title>Tom Sawyer</title>
<author>Mark Twain</author>
</book>

Non-root Elements

Non-root elements (or subelements) may appear as many times as desired, as long as they are properly nested within the document's structure. Subelements are said to be "children" of the "parent" elements they are nested within. The logical structure of a document's components can also be represented by a tree

the tree trunk is the root element
the branches are subelements that contain other subelements
the leaves are subelements that don't contain other subelements]

White space

In XML, white space is not automatically collapsed into a single space as is the case in HTML. In some cases, white space is permitted but may cause confusion. For example, a white space outside the tags might be interpreted as a "space" character.

XML only allows white space in specific locations within a starting or ending element. White space is not allowed before element names in either the starting or ending tags.

Example 1 below would produce an error (because whitecap is not allowed in the beginning of an element name), but Example 2 would not (because an unlimited amount of white space is allowed between an element name and an attribute name).

Example 1

<name>Blaster</ name>

Example 2

<name date="July" >Blaster</name>

If you are just starting out with XML, it's best not to get too fancy with your use of white space.

1.6 Requirements of Well-formed Documents

XML documents are structured specifically to be reshaped and re-purposed on-demand. For this reason, it is very important for the beginning and ending of each piece of data contained within an XML document to be clearly-defined.

An XML document's syntax must be well-formed, so its separate pieces can be easily recognized by an XML parser.

XML's "well-formedness" requirements are:

All starting elements must have ending elements.
All elements must be "nested" properly.
Attribute values must be properly quoted.
Empty elements must be properly "terminated."
Only one root element.

The code fragment below provides examples of two of XML's "typical" well-formedness violations: misnested elements (in red and purple) and misquoted attribute values (in red).

<person><name type="customer"> Tom Jones <phone> 555-1234</name> </phone></person>

Correctly-nested elements/Properly-quoted attribute values

<person><name type="customer"> Tom Jones </name><phone> 555-1234 </phone></person>

This work is licensed under a Creative Commons License.