CSE133 - Spring 2001
Machine Problem 2
Creating an Inverted Index

Richard K. Belew

 

Due: 22 April 2001 (Sunday, 11:59PM)


 In this assignment you will take the first step towards creating your own search engine, ``inverting'' all documents in the corpus. Inverting the corpus means that for each keyword you store a list of documents that contain that keyword. In a later assignment, when you create a system to parse queries and retrieve documents ranked by relevance from the corpus, the inverted index you create here will be vital. The inverted index allows you to quickly determine which documents contain the keywords given by user queries.

1. Preliminaries

1.
You are to work on this assignment in groups of three, with the group assigned to you randomly. Work should be turned in (via the bundle script) under the User ID of the first (i.e., alphabetically superior) member of the group.
2.
The data corpus to be used for this (and all subsequent MP's) will be the ''Artificial Intelligence Thesis'' (AIT) corpus. This data is spread across the three files ait1.xml ait2.xml ait3.xml; all these can be found in the class data directory, ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/data/. (As long as you are working on UCSD-ACS machines, DO NOT MAKE COPIES OF THESE DATAFILES, simply create symbolic links to them.) These files are slightly modified from the files distributed on the CD so that they will be easier to use with a standard XML parser.  We will also create some smaller files with the same format for grading purpose.
3.
We recommend that you complete the stages of this assignment in the order listed. This will allow you to receive partial credit if you accomplish some parts but not everything. Note that this does not imply your program should make multiple passes; use simple method ``stubs'' in your first versions which you later flush out as you complete subsequent stages.
4.
Your assignment will be graded according to a number of criteria, including:
correctness 
Does your program produce the output it should, is it robust in the face of non-standard input;
efficiency 
Does your solution use appropriate algorithms and data structures that allow it to run as quickly, and in as little memory, as possible?
documentation
Is your code easy to understand? Have you clearly delineated what code you developed yourself, what other code you may have reused and its source? In particular, all functions implemented by you and your partners must make use of the JavaDoc format for commenting. Refer to http://java.sun.com/products/jdk/javadoc/writingdoccomments/ for detailed information on how to comment using the JavaDoc standard.
5.
We have provided several libraries for your use (see below). This Java source is provided in the directory ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/src/mp2/, and the corresponding compiled classes are at ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/classes/mp2/. While testing, you will need to point your CLASSPATH to the right files.

CLASSPATH is an enviornment variable used by java to find .class files at runtime. The classpath enviornment varible should be something like
setenv CLASSPATH .:/home/solaris/ieng9/cs133s/public/classes
(It is usually a good idea to have the current directory in your classpath.)
 

2. Inter-document parsing (40%)

2.0 XML Parser (30%)

The first step to this assignment is to understand how to parse the AIT corpus. Parsing the corpus means moving through the corpus text identifying document and field breaks, and then extracting textual fields for further processing.

In the case of the AIT corpus, we will be treating each thesis abstract as a ``document'';  Here are a couple of sample AIT records:

<THESIS>
<NUMBER> 1 </NUMBER>
<ORDER> AAD91-05465 </ORDER>
<TITLE> A UNIFIED APPROACH TO ANALOGICAL REASONING </TITLE>
<AUTHOR> SHINN, HONG SHIK </AUTHOR>
<YEAR> 1989 </YEAR>
<INSTITUTION> GEORGIA INSTITUTE OF TECHNOLOGY (0078) </INSTITUTION>
<DESCRIPTORS> COMPUTER SCIENCE; ARTIFICIAL INTELLIGENCE </DESCRIPTORS>
<ADVISER> JANET L. KOLODNER </ADVISER>
<CLASSIFICATIONS> MACHINE LEARNING </CLASSIFICATIONS>
<ABSTRACT>
Experiential reasoning is the most basic form of intelligent activity,
consequently, in artificial intelligence, numerous computational models of
analogical or case-based reasoning have been attempted. ...
</ABSTRACT>
</THESIS>
<THESIS>
<NUMBER> 2 </NUMBER>
 ...
For this assignment we will only be indexing TITLE and ABSTRACT fields. You are to write a parser that walks through all of the records found in the data files, identifies each document, and produces two files which will be used for subsequent processing. You should not actually output the files. You will need the files for the next assignment, but we will test that you are generating them correctly by testing functions within your code.

You should not write a parser from scratch. You should make use of existing technology. Specifically the jaxp interface for XML parsing in java.

The code provided on the FOA CD uses a different parsing method. Part of this assignment is to get you familar with parsing XML documents using a standard XML parser. There is a parser specification called SAX (Simple API to XML) that you will use to parse the AIT files. The javadoc documentation from sun can be found here. Here is a little example of how to use the SAX interface to parse an XML file.

        SAXParserFactory spf = SAXParserFactory.newInstance();
        SAXParser sp = spt.newSAXParser();
        HandlerBase hb = new AITHandlerBase();  // my implementation that extends HandlerBase
        File file = new File("filename.xml");
        InputSource is = new InputSource("file:" + file.getAbsolutePath());
        sp.parse(is, hb);
Notice that by extending HandlerBase you do not need to implement all of the methods, just that methods that you want to override. Note, make sure to call your extended class AITHandlerBase.

2.1 files.data. (5%)

Do not output this file You should call the method fileInfo from the FileInfo interface once for each file. This interface can be found on ieng9 at public/src/mp2/FileInfo.java We will also provide a dummy implementation of this interface that can be found in the same location. You should create an instance of this class and call it as soon as you have the needed information for a file.

import mp2.*;
...
// call only once
FileInfo fi = new FileInfoImpl();
// call it every time you get enough information for a file
fi.fileInfo(...);

This info will be necessary when you actually go to retrieve documents when queried, but we do not want you to output the files.data file in the code that you turn in. The file number should start from 0. See the documentation of FileInfo.java for more information.

2.2 documents.data . (5%)

Do not output this file. Again, this info will be necessary later, but for now you just need to call the method docInfo in the DocInfo interface for each document that you encounter. (Reminder: Each thesis is considered one document, including the title and the abstract.)

import mp2.*;
...
// call it only once
DocInfo di = new DocInfoImpl();
// call it every time you get enough information for a document
di.docInfo(...);

See the documentation of DocInfo.java for more information. Notice that the document number is inside the XML files.

3. Identifying keywords (20%)

Having identified the textual passages of interest within each document, the next step is to identify potential keywords within these. You will also need to count the number of sentences in each thesis(document).
 
 

3.1 Trimming of punctuation (5%)

When detecting keywords you should assume that each piece of whitespace or punctuation will seperate a token. (i.e. the stuff inside these*parenthesis counts as,10 tokens). All keywords should be only alphabetic characters (the 10 is not a token). Some routines have been provided on the FOA CD that remove the trailing and leading punctuation marks from tokens. They do not acount for punctuation or whitespace in the middle of a token. Punctuation must be removed before checking for noise words and calling stemmer routines.

3.2 Removing Noise Words (2.5%)

The second step is to remove very common ``noise words'' from the set of keywords you obtained by parsing the corpus. The technique we will use is to work from an extensional list of these words, sometimes called a ``negative dictionary.'' The list we will use is in /home/solaris/ieng9/cs133s/public/data/stop.wrd.

A routine is provided to do this:

private static boolean stop_Word(String word)
which returns TRUE if the token passed to it is in the negative dictionary, and FALSE if the word is not.

Make sure that you read the documentation on that routine (i.e. what do do before you call it the first time).

3.3 Stemming (2.5%) 

The third step is to perform a morphological process of normalizing each keyword down to its essential root. This process is known as ``stemming.'' (cf. FOA§2.3.1).

Again, to allow you to focus your energies on other questions, for this assignment you have been provided with a Porter-style stemmer found in ieng9.ucsd.edu:/cs133s/public/src/foa/mp2/PorterStemmer.java. The class PorterStemmer has a single public static method:

public static String stem(String token)
Each non-noiseword token should have stemming performed on it.
 

3.3 Sentences(10%)

The heuristic that you will use to count sentences is the following: A sentence is one or more keywords followed by a period (.), a question mark (?), or an exclamation point (!), followed by some whitespace and a capital letter. All AIT titles will count as one sentence even if there is no punctuation, and all Abstracts will end with a sentence even if there is no capital letter following the final end of sentence marker (.?!). Be aware that you need to record the sentence number for later use. The sentence number should be local to the document and starting from 0.
 
 

4. Inverting the corpus (30%)

Having identified the token/keywords of interest, we are now in a position to collect the fundamental ``posting'' data: what tokens occur in each document. ``Inverting'' then refers to the process of collecting information that lets us know what documents contain occurances of these keywords.

This is the most memory-intensive aspect of the assignment, as you must read the entire corpus and maintain all posting information. Once all documents are read, you are to output the inverted index file invertedIndex.data.

You should output a file called invertedIndex.data It will be needed in the next assignment, but we will be testing based on the interface InvertedIndex.

We are providing an output module called OutputInvertedIndex.class that will assist in the formatting of the inverted index file. You need to create an instance of outputInvertedIndex, and then use the

public void outputNextLine(String keyword, int docFreq, int totFreq,
 Vector v )
function to produce invertedIndex.data, which should look like the example below:
;;keyword  docfreq  totfreq freq1
aa 3 10 7 -1 1635 : 1 4 4 30 31 31 56 ; -2 2 -1 58 : 10 24 ; -2 1 -1 2004 : 20 ; -2
aashto 3 5 2 -1 953 : 1 7 ; 3966 : 2 2 ; -2 1 -1 3287 : 10 ; -2

The first line should be explained as: aa appeared in 3 documents for 10 times. In document 1635, it appears 7 times, in the sentences 1,4(twice),30,31(twice),56. In document 58, it appears twice, in the sentences 10, 24. In document 2004, it appears once in the sentence 20. Try to explain the second line yourself.

NOTE: in the invertedIndex.data, the keywords have to be sorted. Also, for each frequency, the document numbers have to be sorted, and for each document, the sentence numbers have to be sorted.

docfreq and totfreq are the number of documents containing the keyword, and the total number of times it occurs across all of them, resp. Following these data is a series of frequencies and lists of document IDs and sentence IDs in which the keyword occurred with that frequency, ordered in decreasing frequency order. (The document lists are delimited by two negative numbers,-1 and -2 that you can think of as open- and closing-parentheses, resp.)

You can get the information of DocumentFreqVector and SentenceVector from their documentation.

Producing this file means you will have to keep track of

Obviously you may not have to track each one of these values explicitly. Some of the values are dependant on other values and can be arrived at via some simple arithmetic.

5. Morphologically conflated keywords

Another useful form of analysis concerns the various tokens that are ``conflated'' (combined into) the same keyword element. Termdocs.data should contain a listing of all keywords in order of decreasing total frequency. ( Terms with the same frequency should be listed alphabetically within frequency.) After each term, give a list of the word tokens that stem to this term.

You will need to output this file. It will be tested throught the InvertedIndex interface as well.

A Java class file, TermDocsOutput, has been provided to help you format your output of this data file. To make use of this class, create an instance of it and call the function

/* @param term            The next index term
   @param preStemmedTerms An array of all terms that stem to `term'
*/
public void outputNextLine(String term, String[] preStemmedTerms)
The output from this class will look like this:
common -1 commonality commonly -2
special -1 specially specialty -2
stem -1 stems -2
word -1  -2
(The lists are again delimited by two negative numbers,-1 and -2 that you can think of as open- and closing-parentheses, resp.)

The keywords(terms) have to be sorted by decreasing total frequency. And, for the token list of each term, the tokens have to be sorted alphabetically.

Note that Termdocs.data is never going to be used by the search engine we are building to actually provide answers to queries. Rather, Termdocs.data is an analytic too. Examining this file helps the search engine designer see the effects of the morphological algorithms being used. As mentioned the above, stemming is never a perfect process. Perhaps you can find occurrences where this process could be improved.

6. What to turn in?

Create a directory with the following files in it:
CreateInveretedIndex.java
This needs to be that class that implements the interface InvertedIndex. Don't for get to call the methods in the interfaces that the TAs provide inside of your code. That is how part of your grade will be determined. When executed, CreateInveretedIndex should generate invertedIndex.data, termDocs.data
AITHandlerBase.java
Your extension to the HandlerBase class.
*.java
Any other Java classes used by your code
README.TXT
A brief summary of your work. Be sure to mention:
as well as any other notes you would like to pass along about the compilation, execution, or output of your program.
FIRST, use tar to combine any and all Java source files into a file called mp2-src.tar; this must include CreateInvertedIndex.java and AITHandlerBase.java. Then use bundleMP2. files.

7. Grading policy

The final grade will be based on your correctness, effiency, documentation and your contribution to your team.

  • 1 XML parsing: 30 points
  • 2 files information: 5 points
  • 3 documents information: 5 points
  • 4 Indentifing keywords: 20 points
  • 5 Creating InvertedIndex: 30 points
  • 6 Creeating TermDocs: 10 points

    You will get an efficiency bonus or penalty ranging from 0.8 to 1.2 (Your program will be considerd very inefficient if it cannot create an nearly correct InvertedIndex and Termdocs). You will also get a participation rate ranging from 0 to 1 and a documentation rate ranging from 0.8 to 1.

    The final score will be the summation from 1 to 6, then times the three rates above.


    Last updated 4/9/2001 by rik@cs.ucsd.edu