CSE133 - Spring 2001
Machine Problem 2
Creating an Inverted Index
Due: 22 April 2001 (Sunday, 11:59PM)
In this assignment you will take the first step towards creating
your own search engine, ``inverting'' all documents in the corpus. Inverting
the corpus means that for each keyword you store a list of documents that
contain that keyword. In a later assignment, when you create a system to
parse queries and retrieve documents ranked by relevance from the corpus,
the inverted index you create here will be vital. The inverted index allows
you to quickly determine which documents contain the keywords given by
You are to work on this assignment in groups of three, with the group assigned
to you randomly. Work should be turned in (via the bundle script) under the User
ID of the first (i.e., alphabetically superior) member of the group.
The data corpus to be used for this (and all subsequent MP's) will be the
''Artificial Intelligence Thesis'' (AIT) corpus. This data is spread across
the three files ait1.xml ait2.xml ait3.xml; all these can be found
in the class data directory, ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/data/.
(As long as you are working on UCSD-ACS machines, DO NOT MAKE COPIES
OF THESE DATAFILES, simply create symbolic links to them.) These
files are slightly modified from the files distributed on the CD so that
they will be easier to use with a standard XML parser. We will
also create some smaller files with the same format for grading purpose.
We recommend that you complete the stages of this assignment in the order
listed. This will allow you to receive partial credit if you accomplish
some parts but not everything. Note that this does not imply your
program should make multiple passes; use simple method ``stubs'' in your
first versions which you later flush out as you complete subsequent stages.
Your assignment will be graded according to a number of criteria, including:
Does your program produce the output it should, is it robust in the face
of non-standard input;
Does your solution use appropriate algorithms and data structures that
allow it to run as quickly, and in as little memory, as possible?
Is your code easy to understand? Have you clearly delineated what code
you developed yourself, what other code you may have reused and its source?
In particular, all functions implemented by you and your partners must
make use of the JavaDoc format for commenting. Refer to http://java.sun.com/products/jdk/javadoc/writingdoccomments/
for detailed information on how to comment using the JavaDoc standard.
We have provided several libraries for your use (see below). This Java
source is provided in the directory ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/src/mp2/,
and the corresponding compiled classes are at ieng9.ucsd.edu:/home/solaris/ieng9/cs133s/public/classes/mp2/.
While testing, you will need to point your CLASSPATH to the right files.
CLASSPATH is an enviornment variable used by java to find .class files
at runtime. The classpath enviornment varible should be something like
setenv CLASSPATH .:/home/solaris/ieng9/cs133s/public/classes
(It is usually a good idea to have the current directory in your classpath.)
2. Inter-document parsing (40%)
2.0 XML Parser (30%)
The first step to this assignment is to understand how to parse the AIT
corpus. Parsing the corpus means moving through the corpus text identifying
document and field breaks, and then extracting textual fields for further
In the case of the AIT corpus, we will be treating each thesis abstract
as a ``document''; Here are a couple of sample AIT records:
<NUMBER> 1 </NUMBER>
<ORDER> AAD91-05465 </ORDER>
<TITLE> A UNIFIED APPROACH TO ANALOGICAL REASONING </TITLE>
<AUTHOR> SHINN, HONG SHIK </AUTHOR>
<YEAR> 1989 </YEAR>
<INSTITUTION> GEORGIA INSTITUTE OF TECHNOLOGY (0078) </INSTITUTION>
<DESCRIPTORS> COMPUTER SCIENCE; ARTIFICIAL INTELLIGENCE </DESCRIPTORS>
<ADVISER> JANET L. KOLODNER </ADVISER>
<CLASSIFICATIONS> MACHINE LEARNING </CLASSIFICATIONS>
Experiential reasoning is the most basic form of intelligent activity,
consequently, in artificial intelligence, numerous computational models of
analogical or case-based reasoning have been attempted. ...
<NUMBER> 2 </NUMBER>
For this assignment we will only be indexing TITLE and ABSTRACT
fields. You are to write a parser that walks through all of the records
found in the data files, identifies each document, and produces two files
which will be used for subsequent processing. You should not actually
output the files. You will need the files for the next assignment, but
we will test that you are generating them correctly by testing functions
within your code.
You should not write a parser from scratch. You should make use of existing
technology. Specifically the jaxp
interface for XML parsing in java.
The code provided on the FOA CD uses a different parsing method. Part
of this assignment is to get you familar with parsing XML documents using
a standard XML parser. There is a parser specification called SAX (Simple
API to XML) that you will use to parse the AIT files. The javadoc documentation
from sun can be found here.
Here is a little example of how to use the SAX interface to parse an XML
SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser sp = spt.newSAXParser();
HandlerBase hb = new AITHandlerBase(); // my implementation that extends HandlerBase
File file = new File("filename.xml");
InputSource is = new InputSource("file:" + file.getAbsolutePath());
Notice that by extending HandlerBase you do not need to implement all of
the methods, just that methods that you want to override. Note, make
sure to call your extended class AITHandlerBase.
2.1 files.data. (5%)
Do not output this file You should call the method fileInfo
from the FileInfo interface once for each file. This interface can
be found on ieng9 at public/src/mp2/FileInfo.java
We will also provide a dummy implementation of this interface that can be found
in the same location.
You should create an instance of this class and call it as soon as you have the
needed information for a file.
// call only once
FileInfo fi = new FileInfoImpl();
// call it every time you get enough information for a file
This info will be necessary when you actually go to retrieve documents when queried, but we do not want you to output the files.data file in the code that you turn in. The file number should start from 0.
See the documentation of FileInfo.java for more information.
2.2 documents.data . (5%)
Do not output this file. Again, this info will be necessary later,
but for now you just need to call the method docInfo in the DocInfo interface for each document that you encounter. (Reminder: Each thesis is considered one document, including the title and the abstract.)
// call it only once
DocInfo di = new DocInfoImpl();
// call it every time you get enough information for a document
See the documentation of DocInfo.java for more information. Notice that the document number is inside the XML files.
3. Identifying keywords (20%)
Having identified the textual passages of interest within each document,
the next step is to identify potential keywords within these. You will
also need to count the number of sentences in each thesis(document).
3.1 Trimming of punctuation (5%)
When detecting keywords you should assume that each piece of whitespace
or punctuation will seperate a token. (i.e. the stuff inside these*parenthesis
counts as,10 tokens). All keywords should be only alphabetic characters
(the 10 is not a token).
Some routines have been provided on the FOA CD that remove the trailing
and leading punctuation marks from tokens. They do not acount for punctuation
or whitespace in the middle of a token.
Punctuation must be removed before
checking for noise words and calling stemmer routines.
3.2 Removing Noise Words (2.5%)
The second step is to remove very common ``noise words'' from the set of
keywords you obtained by parsing the corpus. The technique we will use
is to work from an extensional list of these words, sometimes called a
``negative dictionary.'' The list we will use is in
A routine is provided to do this:
private static boolean stop_Word(String word)
which returns TRUE if the token passed to it is in the negative
dictionary, and FALSE if the word is not.
Make sure that you read the documentation on that routine (i.e. what
do do before you call it the first time).
3.3 Stemming (2.5%)
The third step is to perform a morphological process of normalizing each
keyword down to its essential root. This process is known as ``stemming.''
Again, to allow you to focus your energies on other questions, for this
assignment you have been provided with a Porter-style stemmer found in
The class PorterStemmer has a single public static method:
public static String stem(String token)
Each non-noiseword token should have stemming performed on it.
The heuristic that you will use to count sentences is the following: A
sentence is one or more keywords followed by a period (.), a question mark
(?), or an exclamation point (!), followed by some whitespace and a capital
letter. All AIT titles will count as one sentence even if there is no punctuation,
and all Abstracts will end with a sentence even if there is no capital
letter following the final end of sentence marker (.?!). Be aware that you need to record the sentence number for later use. The sentence number should be local to the document and starting from 0.
4. Inverting the corpus (30%)
Having identified the token/keywords of interest, we are now in a position
to collect the fundamental ``posting'' data: what tokens occur in each
document. ``Inverting'' then refers to the process of collecting information
that lets us know what documents contain occurances of these keywords.
This is the most memory-intensive aspect of the assignment, as you must
read the entire corpus and maintain all posting information. Once all documents
are read, you are to output the inverted index file invertedIndex.data.
You should output a file called invertedIndex.data It will be needed in
the next assignment, but we will be testing based on the interface InvertedIndex.
We are providing an output module called OutputInvertedIndex.class
that will assist in the formatting of the inverted index file. You need
to create an instance of outputInvertedIndex, and then use the
public void outputNextLine(String keyword, int docFreq, int totFreq,
Vector v )
function to produce invertedIndex.data, which should look like
the example below:
;;keyword docfreq totfreq freq1
aa 3 10 7 -1 1635 : 1 4 4 30 31 31 56 ; -2 2 -1 58 : 10 24 ; -2 1 -1 2004 : 20 ; -2
aashto 3 5 2 -1 953 : 1 7 ; 3966 : 2 2 ; -2 1 -1 3287 : 10 ; -2
The first line should be explained as:
aa appeared in 3 documents for 10 times. In document 1635, it appears 7 times, in the sentences 1,4(twice),30,31(twice),56. In document 58, it appears twice, in the sentences 10, 24. In document 2004, it appears once in the sentence 20.
Try to explain the second line yourself.
in the invertedIndex.data, the keywords have to be sorted.
Also, for each frequency, the document numbers have to be sorted,
and for each document, the sentence numbers have to be sorted.
docfreq and totfreq are the number of documents containing
the keyword, and the total number of times it occurs across all of them,
resp. Following these data is a series of frequencies and lists of document IDs and sentence IDs in which the keyword occurred with that frequency, ordered in decreasing frequency order. (The document lists are delimited by two negative numbers,-1 and -2 that you can think of as open- and closing-parentheses, resp.)
You can get the information of DocumentFreqVector and SentenceVector from their documentation.
Producing this file means you will have to keep track of
Obviously you may not have to track each one of these values explicitly.
Some of the values are dependant on other values and can be arrived at
via some simple arithmetic.
the list of keywords
the number of documents the keyword appears in
the number of times each keyword appears in each document and its sentences.
the total number of times each keyword appears in the entire corpus
Another useful form of analysis concerns the various tokens that are
``conflated'' (combined into) the same keyword element. Termdocs.data
should contain a listing of all keywords in order of decreasing
total frequency. ( Terms with the same frequency should be listed alphabetically
within frequency.) After each term, give a list of the word tokens that
stem to this term.
You will need to output this file. It will be tested throught the InvertedIndex interface as well.
A Java class file, TermDocsOutput, has been provided to help
you format your output of this data file. To make use of this class, create
an instance of it and call the function
/* @param term The next index term
@param preStemmedTerms An array of all terms that stem to `term'
public void outputNextLine(String term, String preStemmedTerms)
The output from this class will look like this:
common -1 commonality commonly -2
special -1 specially specialty -2
stem -1 stems -2
word -1 -2
(The lists are again delimited by two negative numbers,-1 and -2 that you
can think of as open- and closing-parentheses, resp.)
The keywords(terms) have to be sorted by decreasing total frequency. And, for the token list of each term, the tokens have to be sorted alphabetically.
Note that Termdocs.data is never going to be used by the search
engine we are building to actually provide answers to queries. Rather,
Termdocs.data is an analytic too. Examining this file helps the
search engine designer see the effects of the morphological algorithms
being used. As mentioned the above, stemming is never a perfect process.
Perhaps you can find occurrences where this process could be improved.
6. What to turn in?
Create a directory with the following files in it:
FIRST, use tar
to combine any and all Java source files into a file called
this must include
CreateInvertedIndex.java and AITHandlerBase.java.
Then use bundleMP2.
This needs to be that class that implements the interface InvertedIndex.
Don't for get to call the methods in the interfaces that the TAs provide
inside of your code. That is how part of your grade will be
determined. When executed,
CreateInveretedIndex should generate invertedIndex.data,
Your extension to the HandlerBase class.
Any other Java classes used by your code
A brief summary of your work. Be sure to mention:
as well as any other notes you would like to pass along about the
compilation, execution, or output of your program.
just how far you got
what problems remain
any external code libraries you exploited
reports on any time/space profiling
how you and your partner divided up the work
7. Grading policy
The final grade will be based on your correctness, effiency, documentation and
your contribution to your team.
1 XML parsing: 30 points
2 files information: 5 points
3 documents information: 5 points
4 Indentifing keywords: 20 points
5 Creating InvertedIndex: 30 points
6 Creeating TermDocs: 10 points
You will get an efficiency bonus or penalty ranging from 0.8 to 1.2 (Your program will be considerd very inefficient if it cannot create an nearly correct InvertedIndex and Termdocs).
You will also get a participation rate ranging from 0 to 1 and a documentation rate ranging from 0.8 to 1.
The final score will be the summation from 1 to 6, then times the three rates above.
Last updated 4/9/2001 by firstname.lastname@example.org