CSE134A LECTURE NOTES

June 4, 2001
 
 

DOCUMENT TYPE DEFINITIONS (DTDs)

An XML document is well-formed if it satisfies the XML syntax rules.  If it satisfies a DTD also, then it is valid.

A DTD specifies application-specific syntax.  It cannot specify constraints like "this piece of data is a year after 2000" or even "this piece of data is a number."

In an XML document, the DTD to use is given by something like a special tag, for example

<!DOCTYPE person SYSTEM "http://www.ucsd.edu/person.dtd">
In general DTDs can be thousands of lines long, but the basics are simple.  For example:
<!ELEMENT person (name, job*)>
<!ELEMENT name (first, middle?, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT paragraph (#PCDATA | name | footnote | date)*>
<!ELEMENT image EMPTY>
#PCDATA means parsed character data.  In this type of free text, special characters must be written as &lt; and &amp;  Any XML parser translates these before passing the text to any application using the parser.

A section written <![CDATA[ text ]]> doesn't need escaped characters.  VoiceXML uses CDATA to include grammars for voice recognition.

The number of appearances allowed for a nested element is indicated by * or ? or +.  Parentheses indicate grouping.

If #PCDATA is one choice among others, the content of the element is said to be mixed.
 
 

ATTRIBUTE DECLARATIONS

Each element named in a DTD can have one or more ATTLIST declarations.  An empty element can still have attributes.  For example
<!ATTLIST image source     CDATA       #REQUIRED
                width      NMTOKEN     #IMPLIED
                height     NMTOKEN     #IMPLIED
                format     CDATA       #FIXED "jpeg"
                alt        CDATA       "No caption provided."
                catalogno  ID          #REQUIRED
                owner      IDREF       "Unknownn_owner"
>
The meaning of the modifier #REQUIRED is obvious.  #IMPLIED means the attribute is optional, and no default value is provided.  A literal value is a default for when the attribute is not given a value.

CDATA means that the content of an attribute value can be aribtrary text inside quotation marks, while NMTOKEN means the content must be a legal XML name.
 
 

IDENTIFIERS AND REFERENCES

If an ATTLIST declaration gives the type ID to an attribute, then each value for that attribute must be an XML name that is unique in the whole document.

Conversely, the type IDREF means that the attribute value must be an XML name that is given elsewhere in the document as the value of an ID attribute.

Ids and reference to ids are used to establish many-to-many relationships between entities inside a document.  Consider converting a database of multiple tables into a single XML document...
 
 

XHTML

One simple application of XML is to redefine HTML with a DTD.  This is called XHTML.  All the changes can be automated, but they are extensive.  Many HTML ambiguities and incorrectnesses must be removed.  In particular: Fortunately, software will translate to XHTML automatically: HTML Tidy.

Recent browsers (IE 5.5 and Netscape 6.0) handle XHTML, though not all do so perfectly.  Unfortunately application developers have to still cater to older browsers.
 
 

CASCADING STYLE SHEETS

A special tag beginning <? and ending ?> is a processing instruction.  These are considered to be markup, but not elements, so they can appear outside the root element.  Script code, e.g. PHP code, is a special case.
<?xml version="1.0">
<?xml-stylesheet href="person.css" type="text/css"?>
IE5 and later can display XML documents with or without a stylesheet.
 
 



Copyright (c) by Charles Elkan, 2001.