I will talk more about regular expressions in a later lecture. Today I will give an introduction to relational databases and SQL.
For the project, it is very important that you adopt a well-organized
and efficient strategy for doing information extraction. Do not just
use regular expressions developed by trial and error. One part of
your report should be an explanation of your strategy, which should
be as clear and simple as possible. In the report, describe the capabilities
and limitations of your strategy. Which changes in the data sources
could you handle, and which changes would break your strategy?
The db client is a PHP script. The script sends commands written
in the SQL language. The db server sends responses. A response
is often a record containing data.
if (!($link=mysql_pconnect($hostname, $username, $password)))
printerror("cannot connect
to $hostname by $username");
if (!mysql_select_db($dbname, $link))
printerror("cannot select
$dbname database");
pconnect establishes a persistent connection, meaning essentially
that many scripts can reuse the same connection, which is much more efficient
than creating a new process on the db server.
create table temp (
rank integer unsigned
not null auto_increment primary key,
k
integer unsigned not null,
s
double not null,
d
datetime not null,
quantile double not null )
The db server takes care of maintaining the tables on disk, of allowing
multiple db clients to read and write to them without data corruption,
and of caching them in memory to increase read/write speed.
The strange range for TIMESTAMP comes from a Unix-specific representation.
create table temp (
rank integer unsigned
not null auto_increment primary key,
k
integer unsigned not null,
s
double not null,
d
datetime not null,
quantile double not null )
The answer from a query is called a "result set." Conceptually, a result set is the same as a new table.
Result sets can be ordered:
select author, date
from postings
where date > '2000-11-11'
order by date
Each line that adds detail to what should be selected is called a "clause." The example above has from, where, and order by clauses. By default order is increasing.
We can compute summary information about result sets:
select author, count(*)
from postings where date > '2000-11-11'
group by author