CSE134A LECTURE NOTES

April 16, 2001
 
 

ANNOUNCEMENTS

Some of the TA office hours are being converted into lab hours.  Check the discussion boards for announcements.  Note that the new Discus boards are in use now.

I will talk more about regular expressions in a later lecture.  Today I will give an introduction to relational databases and SQL.

For the project, it is very important that you adopt a well-organized and efficient strategy for doing information extraction.  Do not just use regular expressions developed by trial and error.  One part of your report should be an explanation of your  strategy, which should be as clear and simple as possible.  In the report, describe the capabilities and limitations of your strategy.  Which changes in the data sources could you handle, and which changes would break your strategy?
 
 

THREE-TIER ARCHITECTURE

The third tier is the database (db) server.  In general, the db server and the web server are different computers.  They communicate via a network, and specifically via Unix sockets.

The db client is a PHP script.  The script sends commands written in the SQL language.  The db server sends responses.  A response is often a record containing data.
 
 

CONNECTING TO THE MYSQL SERVER

// MySQL database connection information
$username = "php";
$password = "pw";
$hostname = "localhost";
$dbname = "mydb";

if (!($link=mysql_pconnect($hostname, $username, $password)))
        printerror("cannot connect to $hostname by $username");
if (!mysql_select_db($dbname, $link))
        printerror("cannot select $dbname database");

pconnect establishes a persistent connection, meaning essentially that many scripts can reuse the same connection, which is much more efficient than creating a new process on the db server.
 
 

TABLES AND RECORDS

Tables are divided into rows and columns.  A row is also called a record or tuple.  A column is also called a field or attribute.  A table is also called a relation

create table temp (
    rank     integer unsigned not null auto_increment primary key,
    k        integer unsigned not null,
    s        double not null,
    d        datetime not null,
    quantile double not null )

The db server takes care of maintaining the tables on disk, of allowing multiple db clients to read and write to them without data corruption, and of caching them in memory to increase read/write speed.
 
 

FIELD TYPES

Unlike in PHP, in MySQL and in all other relational databases, columns have fixed, predeclared types.  The most important types are: Modern database systems can handle very large fields, e.g. photographs.  BLOB stands for "binary large object."  BLOB values are case-sensitive while TEXT values are not (a leftover from the days before eight-bit ascii character codes).

The strange range for TIMESTAMP comes from a Unix-specific representation.
 
 

CREATING TABLES

drop table if exists temp

create table temp (
    rank     integer unsigned not null auto_increment primary key,
    k        integer unsigned not null,
    s        double not null,
    d        datetime not null,
    quantile double not null )
 
 

QUERIES

Queries allow data to be retrieved based on its content, not its storage location.  For example:
        select author, date from postings where date > '2000-11-11'

The answer from a query is called a "result set."  Conceptually, a result set is the same as a new table.

Result sets can be ordered:
        select author, date
            from postings
            where date > '2000-11-11'
            order by date

Each line that adds detail to what should be selected is called a "clause."  The example above has from, where, and order by clauses.  By default order is increasing.

We can compute summary information about result sets:
        select author, count(*) from postings where date > '2000-11-11'
                            group by author
 
 



Copyright (c) by Charles Elkan, 2001.