FOA CD ReadMe


The primary code and data files contained on the FOA CDROM are also available via anonymous FTP access at ftp://ftp.cs.ucsd.edu/pub/rik/foa/CD/ in compressed form. These include:

  -rw-r-xr-x   1 rik      fac         18062 Jul 20 14:37 index.html
  -rw-r--r--   1 rik      fac       4994098 Sep 28 12:26 foa-data-ait.tar.gz
  -rw-r--r--   1 rik      fac        110844 Sep 28 12:30 foa-data-aigen.tar.gz
  -rw-r--r--   1 rik      fac        384038 Sep 28 12:25 foa-code.tar.gz
        


Overview

This file is designed to provide a brief overview for the contents of the CD-ROM accompanying the Finding Out About (FOA) textbook. Much of the CD is comprised of HTML versions of FOA and Keith van Rijsbergen's classic Information Retrieval book. The rest of the CD contain program source code and data sets designed to allow assignments and experiments related to topics in FOA.

Most of this code and data was developed specifically for use with classes or research related to FOA; the rest of this document describes this data and code in greater detail. However, because text classification (cf. FOA Section 7.4) is emerging as an important area of research (and because there was room on the CD:), the Reuters classification data, developed by David Lewis, and the RAINBOW classification software, developed by Andrew McCallum, have also been included with the permission of their developers.

The top level directory of the CD should look something like:

     drwxrwx---   2 1383     ai           1024 Jun 26 17:48 FOA/
     drwxrwx---   2 1383     ai           1024 Jun 26 17:48 IR/
     drwxrwx---   5 1383     ai           1024 Jun  8 13:04 code/
     drwxrwx---   6 1383     ai           1024 Jun 26 15:22 data/
     -rw-rw----   1 rik      ai          18062 Jun 14 11:30 index.html
     drwxrwx---   2 1383     ai           1024 Jun 26 16:09 others/
    

FOA and IR contain the FOA and Information Retrieval texts, respectively. others contains the description of the TACHIR tool (developed by Massimo Melucci, Maristell Agosti and Fabio Crestani and used to translate van Rijsbergen's Information Retrieval text into HTML), the Reuters data and RAINBOW code. The following two sections describe:

FOA data sets

The data directory contains the following subdirectories:

     drwxrwx---   4 rik      ai           1024 Jun 26 16:39 EB5/
     drwxrwx---   2 1383     ai           1024 Jun  8 13:05 aigen/
     drwxrwx---   2 1383     ai           1024 Jun 26 17:59 ait/
     drwxrwx---   2 1383     ai           1024 Jun 26 16:41 ancil/
    

AIT corpus

The ``Artificial Intelligence Thesis" (AIT) corpus is approximately 5,000 Ph.D. and Masters dissertation abstracts written on the topic of artificial intelligence. Virtually every dissertation published within the last 30 years has been microfilmed by the UMI® Dissertation Services program run by Bell & Howell Information and Learning Company Copies of these very important documents can be obtained from UMI® Dissertation Services at nominal cost. See FOA Section 2.4 for further details of this data.

ait[123].xml The primary AI Thesis (AIT) corpus files, in XML format.
docLength.txt The length of each of the AIT documents.
documents.d A list of all the documents with info about each one.
files.d List of all the files used to make the inverted Index
stop.wrd The negative dictionary.
invertedIndex.txt The output from foa.mp2.CreateInvertedIndex. It is also the input to foa.mp3.SearchEngine
query.d A series of eight "short" and two "long" queries (cf. FOA Section 3.6 for details of this distinction.) used for testing.
results.txt The results from running the queries against the AIT corpus using the FOA search engine.
other-result.txt The documents ranking for each query using a different search engine for comparison. Each line contains a <docID, score> pair.
rel.d The relavance score of the document for each query. These have been computed using relevance assesment data collected from a large number of students using the RAVE tool (cf. FOA Section 4.4). Each line has a "queryNum," "docID" and "relevance score" where relevance score = (1 | 2), corresponding to "permisive" vs. strict relevance. (See FOA Section 4.4.4 for the definitions of these predicates.)
rel-full.d More complete data on how the relevance scores were created. Each line is of the form:
    1	1	1830	2	[2, 3, 2, 2, 2, 0, 0, 0]
    
where each line has "queryNum," "rank", "docID," "relevance score" and then a square-bracketed list of individuals' relevance assesments. queryNum, rank, docID and relevance score fields are consistent with the rel.d file, and rank is the relevance ranking of the documents. Indivudal relevance assesments are encoded as 0=Not relevant (the default), 1=Possibly relevant, 2=Relevant, 3=Critically relevant.

AI Geneology

The AI Geneology data builds on the AIT data set to include additional data provided by individuals over a number of years, towards the construction of an advisor/advisee "geneology" connecting dissertations as parts of educational lineages; see FOA Section 6.4.1.

Ancillary data

This directory contains several data files that may be useful. ait[1,2,3].t contains slightly earlier, pre-XML versions of the basic AIT corpus. The other files contained data derived from AIT, collating advisors, universities and classifications associated with the dissertations.

Encyclopedia Britannica (EB5)

The Encyclopedia Britannica has been one of the world's most admired texts for more than a century. We are very pleased that EB has agreed to release a portion of this corpus, EB5, that has been used for research in our lab for a number of years; see FOA Section 7.6 for an example of how the EB5 corpus was used for testing of InfoSpiders agents. EB5 corresponds to all encyclopedia entries classified under Section V of the EB Propaedia, corresponding to the topic of "Human Society."

More detailed EB5 documentation, taken from Fillipo Menczer's dissertation [UCSD, CSE Dept, 1998], is also provided, but there are three features of EB5 corpus that are especially important to know immediately. First, the texts have all had stop words removed and (standard Porter) stemming applied as a condition for their public release. Second, these articles include hypertext links connecting them to one another, to the Propaedia classification taxonomy, and to the EB's Index. Finally, EB5 makes extensive use of directory structure, to break sets of articles by letter range. The resulting directories are consequently very deep, so any commands that recursively visit all directories (e.g., ls -lR) should be issued with caution.

FOA source code

The basic routines of the FOA search engine have been implemented twice, first using C and then using Java. We now consider the Java implementation our "reference" standard, but have tried to update the C version to be consistent.

The basic structure and most central features of these routines is discussed in FOA Chapter 2. An overview diagram of how the basic routines interact is also provided.

These modules were designed to support a series of programming assignments, called "machine problems" (MPs) associated with the course for which the FOA text was developed. This decomposition has the advantage of breaking the larger software development effort into a series of assignments students have consistently been able to accomplish within a single academic (10 week) quarter. In several places, references are made to intermediate files which, while not really necessary for an operational search engine, provide "checkpoint" results that allow analysis and grading of intermediate results.

There remain a few minor variations between the C and Java vesions of the code, and these require slightly different foarc files. To run the match on the ait corpus with the output from the java version (the invertedIndex included on the CD) use the foarc-ait file. To run the match on the EB5 stuff use the foa-eb5.match file. It does not require any parameters.

Java implementation

The makefile provided will compile all of the files needed to run all of the parts of the search engine. Each part of the search engine should be run seperately.

To run each part of the search engine you need to have your Classpath enviornment variable set correctly. you can set it manually on the command line with the -cp option. Included in your classpath should be the data directory on this CD, or a copy of that directory. To run MP2 you also need to include two jar files Splay.jar, and java_cup.jar. Note that these are to be found in the /cdrom/code/lib directory on this CD.

To run MP2 from the java directory type:

    java -cp classes:../../data:../../lib/Splay.jar:../../lib/java_cup.jar foa.mp2.CreateInvertedIndex
    

MP2 creates four files: invertedIndex.data, termdocs.data, files.data and documents.data. The invertedIndex file is used by MP3 the termdocs.data file is only for ananlyzing what actually happened it is not used again in the MP assignments. files.data is a list of the files that were parsed and documents.data contains all of the documents that were encountered. This program will take about 1-5 minutes to run (based on the speed of your computer) It is processing 10M worth of AI thesis, and creating an inverted index from them.

To run the mp3 from the java directory type:

    java -cp classes:../../data foa.mp3.SearchEngine
    

MP3 acts as a search engine in a limited form. It reads in queries from a single file. It also reads in the invertedIndex file created in MP2. It then processes the queries, and ranks the documents by relevance for each query and outputs the results.data file. That file contains a list of document IDs and relevance scores for each query.

To run the mp4 from the java directory type:

    java -cp classes:../../data foa.mp4.Evaluate
    

MP4 takes in the results.data from mp3, and calculates the relevance based on user feedback (that has been collected previously). It outputs the results in a form that can be read in directly by GNUplot (available free on most platforms); src/plot_script.gp provides a simple GNUplot script example.

The basic idea to run any part of the search engine is that you need to have the class files included in classpath, as well as the data files, and any additional libraries that you use.

For more information about the individual packages and Java classes see the JavaDoc documentation.

C implementation

This directory contains the files needed to run the C version of the Search Engine. Note that they depend on libraries found in the /cdrom/lib directory.

These are the files included in this directory:

dumbQ.[ch] implements a simple queue.
foa.[ch] contains global varaibles that are set by a file called 'foarc'
foa_structs.h contains data structures that are used across multiplefiles.
foa_utils.[ch] contains some methods that are used by multiple modules.
makefile compiles all of the source files. It depends on the lib directory being in a certain spot relative to this directory. NOTE:the makefile puts the executables in the bin directory!
match.c uses the inverted index and the queries to produce a a hitlistof <document ID, score> pairs.
parse_EB5.c parses the EB5 directory structure.
parse_email.[ch] A crude, kludgy, but fairly effective parser for e-mail
postdoc.c takes the docs.d and file.d files, and creates an inverted index file.
x2foa.[ch] takes either e-mail or the EB5 and creates the files.d and docs.d file needed for postdoc.
x2_foa_utils.[ch] Utilities used by the x2foa code.

Libraries

This directory contains three separate resources. First it contains hash, splay, stem and stopper libraries used by the C routines. See FOA Section 2.5.3 for further details on these.

Second, the directory contains libfoa.a, a compiled library of the C routines. It was compiled using gcc on a Debian Linux Pentium class machine. This directory also contains the makefile that produced this binary, and should be easily modifiable for your platform.

Finally, it contains JAR files that are used in the Java version of the FOA Search Engine (Splay.jar java_cup.jar).

Binaries

This directory contains the executables from all of the files found in the /cdrom/src/C/ directory. NOTE: These binaries were compiled on a Pentium class machine running the Debian Linux OS. If they do not run on your machine, you can compile all of the source files with the makefile that is provided in the /cdrom/src/C directory.

x2foa has been tested only on the EB5 corpus. It should work on your your e-mail also, but hey it might not.

To run it make sure that the paths in the foarc file are correct, an example foarc file can be found at /cdrom/etc/foarc-eb5.x2foa.postdoc This file is good for both x2foa and postdoc.

On the command line give it the path to the EB5 data something like

       % x2foa ../data/EB5/BCD5
      

Since x2foa has been tested only on EB5, the same is true of postdoc. You can use the same foarc file, and you need to make sure the paths are correct again.

If you pipe the output from match to gen_hitlist it will creat a huge html table. Unless you have a super browser you will need to chop the table up some how or it will kill your browser. (Some don't like really big tables).


Last updated 1 July 00 by rik@cs.ucsd.edu