CSE134A LECTURE NOTES

November 14, 2001
 
 

ANNOUNCEMENTS

We're returning the midterm today.  The mean was 39.4 out of 50, with standard deviation 5.5, range 26 to 49.

Project 3 is due today, and Project 4 is being distributed.

The CAPE and TA evaluations will be on Wednesday next week, November 21.  This is the day before Thanksgiving, but do urge all your friends to come to class so they can provide feedback.

See the article Building a Large-scale E-commerce Site with Apache and mod_perl by Perrin Harkins, October 2001.  This is a case study of the etoys.com site.  The company no longer exists, but the site was the third busiest e-commerce site before Christmas 1999 and Christmas 2000, after Ebay and Amazon.  The article shows how to design a multi-tier architecture, similar to the Windows DNA architecture, and how to do component-based programming using a scripting language similar to PHP.

 

XML

XML is a human-readable notation for writing and exchanging structured information of all sorts.  XML stands for "eXtensible meta-Markup Language."  It is a language for "marking up" (i.e. indicating explicitly) syntax and (very slightly) for indicating semantics, i.e. meaning.  Important concepts: XML is not a presentation language (HTML), not a programming language (PHP), not a database language (SQL), not a communication protocol (HTTP).  Instead, XML is a language for portable data.  It can be used as part of a high-level communication protocol, e.g. SOAP, which is discussed below.

XML can be used as a notation for a programming language, e.g. VoiceXML.
 
 

XHTML

One simple application of XML is to redefine HTML.  This is called XHTML.  All the changes can be automated, but they are extensive.  Many HTML ambiguities and incorrectnesses must be removed.  In particular: Fortunately, software will translate to XHTML automatically: HTML Tidy.

Recent browsers (IE 5.5 and Mozilla, i.e. Netscape 6.0) handle XHTML, though not all do so perfectly.  Unfortunately application developers have to still cater to older browsers.
 
 

XML SYNTAX

A document is "well-formed" if it satisfies all the XML syntax rules.  No subsets or supersets of XML are allowed.

An XML document is a tree with exactly one root element, and no overlapping elements.  XML is case-sensitive, and in fact can use non-Western characters.

Start tags are written <elementname ...> and end tags are written </elementname>. Start tags can have attributes, which have the syntax name="value".  There is no XML-defined syntax inside attribute values, so nested elements are preferable.  Also, attributes must be unique for each tag instance.

Tags are nested, and can appear inside free text.  <name/> is an empty tag, unlike in HTML.

In free text, special characters must be written as &lt; and &amp;  Any XML parser translates these before passing the text to any application using the parser.

<?xml version="1.0" encoding="ISO_8859-1" standalone="no"?>                 optional processing instruction
<!DOCTYPE person SYSTEM "http://www.ucsd.edu/person.dtd">
        <person born="1912" died="1954" id="p342">
           <name>
             <first_name>Alan</first_name>
             <last_name>Turing</last_name>
           </name>
           <!-- Did the word computer scientist exist in Turing's day? -->        this is a comment
           <profession>computer scientist</profession>
           <profession>mathematician</profession>
           <profession>cryptographer</profession>
        </person>
A section written <![CDATA[ text ]]> doesn't need escaped characters.  VoiceXML uses CDATA to include grammars for voice recognition.

A tag beginning <? and ending ?> is a processing instruction.  These are considered to be markup, but not elements, so they can appear outside the root element.  Script code, e.g. PHP code, is a special case.
 
 

DOCUMENT TYPE DEFINITIONS

An XML document is well-formed if it satisfies the XML syntax rules.  If it satisfies a document type definition (DTD) also, then it is valid.

A DTD specifies application-specific syntax.  It cannot specify constraints like "this piece of data is a year after 2000" or even "this piece of data is a number."  XML schemas can specify data types, but they are more complex and less widely used.

In an XML document, the DTD to use is given by something like a special tag, for example

<!DOCTYPE person SYSTEM "http://www.ucsd.edu/person.dtd">
In general DTDs can be thousands of lines long, but the basics are simple.  For example:
<!ELEMENT person (name, job*)>
<!ELEMENT name (first, middle?, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT paragraph (#PCDATA | name | footnote | date)*>
<!ELEMENT image EMPTY>
#PCDATA means parsed character data.  In this type of free text, special characters must be written as &lt; and &amp;  Any XML parser translates these before passing the text to any application using the parser.  If #PCDATA is one choice among others, the content of the element is said to be mixed.

The number of appearances allowed for a nested element is indicated by * or ? or +.  Parentheses indicate grouping.
 
 



Copyright (c) by Charles Elkan, 2001.