The primary code and data files contained on the FOA CDROM are also available via anonymous FTP access at ftp://ftp.cs.ucsd.edu/pub/rik/foa/CD/ in compressed form. These include:
-rw-r-xr-x 1 rik fac 18062 Jul 20 14:37 index.html -rw-r--r-- 1 rik fac 4994098 Sep 28 12:26 foa-data-ait.tar.gz -rw-r--r-- 1 rik fac 110844 Sep 28 12:30 foa-data-aigen.tar.gz -rw-r--r-- 1 rik fac 384038 Sep 28 12:25 foa-code.tar.gz
This file is designed to provide a brief overview for the contents of the CD-ROM accompanying the Finding Out About (FOA) textbook. Much of the CD is comprised of HTML versions of FOA and Keith van Rijsbergen's classic Information Retrieval book. The rest of the CD contain program source code and data sets designed to allow assignments and experiments related to topics in FOA.
Most of this code and data was developed specifically for use with classes or research related to FOA; the rest of this document describes this data and code in greater detail. However, because text classification (cf. FOA Section 7.4) is emerging as an important area of research (and because there was room on the CD:), the Reuters classification data, developed by David Lewis, and the RAINBOW classification software, developed by Andrew McCallum, have also been included with the permission of their developers.
The top level directory of the CD should look something like:
drwxrwx--- 2 1383 ai 1024 Jun 26 17:48 FOA/ drwxrwx--- 2 1383 ai 1024 Jun 26 17:48 IR/ drwxrwx--- 5 1383 ai 1024 Jun 8 13:04 code/ drwxrwx--- 6 1383 ai 1024 Jun 26 15:22 data/ -rw-rw---- 1 rik ai 18062 Jun 14 11:30 index.html drwxrwx--- 2 1383 ai 1024 Jun 26 16:09 others/
FOA
and IR
contain the FOA and
Information Retrieval texts, respectively. others
contains the description of the TACHIR
tool
(developed by Massimo Melucci, Maristell Agosti and
Fabio Crestani and used to
translate van Rijsbergen's Information Retrieval text
into HTML), the Reuters data and RAINBOW code. The following
two sections describe:
data
directory containing
the AIT, EB5, AIGeneology, and related data sets, and
code
directory containing FOA program code
resources.
The data directory contains the following subdirectories:
drwxrwx--- 4 rik ai 1024 Jun 26 16:39 EB5/ drwxrwx--- 2 1383 ai 1024 Jun 8 13:05 aigen/ drwxrwx--- 2 1383 ai 1024 Jun 26 17:59 ait/ drwxrwx--- 2 1383 ai 1024 Jun 26 16:41 ancil/
The ``Artificial Intelligence Thesis" (AIT) corpus is approximately 5,000 Ph.D. and Masters dissertation abstracts written on the topic of artificial intelligence. Virtually every dissertation published within the last 30 years has been microfilmed by the UMI® Dissertation Services program run by Bell & Howell Information and Learning Company Copies of these very important documents can be obtained from UMI® Dissertation Services at nominal cost. See FOA Section 2.4 for further details of this data.
ait[123].xml | The primary AI Thesis (AIT) corpus files, in XML format. |
docLength.txt | The length of each of the AIT documents. |
documents.d | A list of all the documents with info about each one. |
files.d | List of all the files used to make the inverted Index |
stop.wrd | The negative dictionary. |
invertedIndex.txt | The output from foa.mp2.CreateInvertedIndex. It is also the input to foa.mp3.SearchEngine |
query.d | A series of eight "short" and two "long" queries (cf. FOA Section 3.6 for details of this distinction.) used for testing. |
results.txt | The results from running the queries against the AIT corpus using the FOA search engine. |
other-result.txt | The documents ranking for each query using a different search engine for comparison. Each line contains a <docID, score> pair. |
rel.d | The relavance score of the document for each query. These have been computed using relevance assesment data collected from a large number of students using the RAVE tool (cf. FOA Section 4.4). Each line has a "queryNum," "docID" and "relevance score" where relevance score = (1 | 2), corresponding to "permisive" vs. strict relevance. (See FOA Section 4.4.4 for the definitions of these predicates.) |
rel-full.d | More complete data on how the relevance scores were created. Each line
is of the form:
1 1 1830 2 [2, 3, 2, 2, 2, 0, 0, 0]where each line has "queryNum," "rank", "docID," "relevance score" and then a square-bracketed list of individuals' relevance assesments. queryNum, rank, docID and relevance score fields are consistent with the rel.d file, and rank is the relevance ranking of the documents. Indivudal relevance assesments are encoded as 0=Not relevant (the default), 1=Possibly relevant, 2=Relevant, 3=Critically relevant. |
More detailed EB5
documentation, taken from Fillipo Menczer's dissertation
[UCSD, CSE Dept, 1998], is also provided, but there are three
features of EB5 corpus that are especially important to know
immediately. First, the texts have all had stop words removed
and (standard Porter) stemming applied as a condition for their
public release. Second, these articles include hypertext links
connecting them to one another, to the Propaedia classification
taxonomy, and to the EB's Index. Finally, EB5 makes extensive
use of directory structure, to break sets of articles by letter
range. The resulting directories are consequently very deep, so
any commands that recursively visit all directories (e.g.,
ls -lR
) should be issued with caution.
The basic structure and most central features of these routines is discussed in FOA Chapter 2. An overview diagram of how the basic routines interact is also provided.
These modules were designed to support a series of programming assignments, called "machine problems" (MPs) associated with the course for which the FOA text was developed. This decomposition has the advantage of breaking the larger software development effort into a series of assignments students have consistently been able to accomplish within a single academic (10 week) quarter. In several places, references are made to intermediate files which, while not really necessary for an operational search engine, provide "checkpoint" results that allow analysis and grading of intermediate results.
There remain a few minor variations between the C and Java
vesions of the code, and these require slightly different
foarc
files. To run the match on the ait corpus
with the output from the java version (the invertedIndex
included on the CD) use the foarc-ait
file. To run
the match on the EB5 stuff use the foa-eb5.match
file. It does not require any parameters.
The makefile provided will compile all of the files needed to run all of the parts of the search engine. Each part of the search engine should be run seperately.
To run each part of the search engine you need to have your
Classpath enviornment variable set correctly. you can set it
manually on the command line with the -cp option. Included in
your classpath should be the data directory on this CD, or a
copy of that directory. To run MP2 you also need to include two
jar files Splay.jar, and java_cup.jar. Note that these are to
be found in the /cdrom/code/lib
directory on this CD.
To run MP2 from the java directory type:
java -cp classes:../../data:../../lib/Splay.jar:../../lib/java_cup.jar foa.mp2.CreateInvertedIndex
MP2 creates four files: invertedIndex.data,
termdocs.data
, files.data
and
documents.data
. The invertedIndex file is used by
MP3 the termdocs.data file is only for ananlyzing what
actually happened it is not used again in the MP assignments.
files.data is a list of the files that were parsed and
documents.data contains all of the documents that were
encountered. This program will take about 1-5 minutes to run
(based on the speed of your computer) It is processing 10M worth
of AI thesis, and creating an inverted index from them.
To run the mp3 from the java directory type:
java -cp classes:../../data foa.mp3.SearchEngine
MP3 acts as a search engine in a limited form. It reads in
queries from a single file.
It also reads in the invertedIndex
file created in
MP2. It then processes the queries, and ranks the documents by
relevance for each query and outputs the
results.data
file. That file contains a list of document
IDs and relevance scores for each query.
To run the mp4 from the java directory type:
java -cp classes:../../data foa.mp4.Evaluate
MP4 takes in the results.data from mp3, and calculates the
relevance based on user feedback (that has been collected
previously). It outputs the results in a form that can be read
in directly by GNUplot (available free on most platforms);
src/plot_script.gp
provides a simple GNUplot script
example.
The basic idea to run any part of the search engine is that you need to have the class files included in classpath, as well as the data files, and any additional libraries that you use.
For more information about the individual packages and Java classes see the JavaDoc documentation.
This directory contains the files needed to run the C version of the Search Engine. Note that they depend on libraries found in the /cdrom/lib directory.
These are the files included in this directory:
dumbQ.[ch] | implements a simple queue. |
foa.[ch] | contains global varaibles that are set by a file called 'foarc' |
foa_structs.h | contains data structures that are used across multiplefiles. |
foa_utils.[ch] | contains some methods that are used by multiple modules. |
makefile | compiles all of the source files. It depends on the lib directory being in a certain spot relative to this directory. NOTE:the makefile puts the executables in the bin directory! |
match.c | uses the inverted index and the queries to produce a a hitlistof <document ID, score> pairs. |
parse_EB5.c | parses the EB5 directory structure. |
parse_email.[ch] | A crude, kludgy, but fairly effective parser for e-mail |
postdoc.c | takes the docs.d and file.d files, and creates an inverted index file. |
x2foa.[ch] | takes either e-mail or the EB5 and creates the files.d and docs.d file needed for postdoc. |
x2_foa_utils.[ch] | Utilities used by the x2foa code. |
This directory contains three separate resources. First it
contains hash, splay, stem
and stopper
libraries used by the C routines. See FOA Section 2.5.3 for further details
on these.
Second, the directory contains libfoa.a
, a
compiled library of the C routines. It was compiled using gcc
on a Debian Linux Pentium class machine. This directory also
contains the makefile
that produced this
binary, and should be easily modifiable for your platform.
Finally, it contains JAR files that are used in the Java version of the FOA Search Engine (Splay.jar java_cup.jar).
This directory contains the executables from all of the files
found in the /cdrom/src/C/
directory. NOTE: These
binaries were compiled on a Pentium class machine running the
Debian Linux OS. If they do not run on your machine, you can
compile all of the source files with the makefile that is
provided in the /cdrom/src/C
directory.
x2foa has been tested only on the EB5 corpus. It should work on your your e-mail also, but hey it might not.
To run it make sure that the paths in the foarc file are
correct, an example foarc
file can be found at
/cdrom/etc/foarc-eb5.x2foa.postdoc
This file is
good for both x2foa and postdoc.
On the command line give it the path to the EB5 data something like
% x2foa ../data/EB5/BCD5
Since x2foa has been tested only on EB5, the same is true of postdoc. You can use the same foarc file, and you need to make sure the paths are correct again.
If you pipe the output from match
to
gen_hitlist
it will creat a huge html table.
Unless you have a super browser you will need to chop the table
up some how or it will kill your browser. (Some don't like really
big tables).
Last updated 1 July 00 by rik@cs.ucsd.edu