Term Project for CSE 100:

Text Searches of Scripture

by Cynthia Bailey, clbailey@ucsd.edu    (class login: cs100wac)



Links:

Data Structure Implementation:  (diagram)
Specification:
Rationalization:

The reason I chose a hashtable is that I wanted something that was simple to use, but still very fast. Java's built-in utility Hashtable was perfect since it handled most of the mundane details automatically, but because the default collision resolution did not give me the control I needed over the data, I decided to do my own. A linked-list open hash was a natural solution since it is simple to implement and has variable length. Variable length was very important because there is a very large range in the numbers of words that will collide for a given three starting characters. (e.g., there is only one word that starts with "zar", but many words that start with "the" or "ste")

The BitSets were great because, again, they were such a natural fit and thus very simple to use. The problem if determining from a list of words, each which has a list of verse numbers, which words are in the same verse was quite a challenge, and without some tricky algorithm work could quickly become a gigantic use of time. Since the bitsets' intersection operation is a bitwise one, it is really fast, and also a perfect match for the problem. As mentioned above, the intersection also worked well for book restrictions as was the union for partial word searches. (ASIDE: A feature that could be added in the future would be to allow the user to disallow words in the verse and/or make a distinction between requiring a word be present and simply listing it as a possibility. Aside from some user interface busywork like parsing what the user wrote, the implementation of these would be fairly trivial using bitset operations.)

Summary of hashtable/bitset combination: ELEGANT! FAST!* (* and uses lots of space)

The array was picked because it has basically no overhead, is really fast, and I know the number of elements it needs beforehand since it is read out of the same file every time.

Limitations:

The hashtable/bitset implementation, although I still believe to be a very good choice, has some limitations:



(Semi-)Formal Space and Time Analysis: (more rationalization!)

As mentioned above, I concede that the space (memory) usage of my program suffered in many ways at the hands of time interests. Also, one of the main problems with my implementation was that the interface does not include a search by literal strings, which was because of the bitset's large space consumption. So, I would like to present a defense of this interface concession in terms of a formal anaylsis of the space and time costs of bitset of verses vs. list of word locations in an array:

SPACE:

TIME:

Lets say a user enters a single word for a search and does not give any restrictions on location. In this case, both bitset and array would take about the same time. This is because in both cases, the word will be looked up in the hashtable, and then a full enumeration of its verses will be given to the user, which will linear in both cases. However, if a user enters a whole list of 3, 5, or even 10 keywords, the situation changes dramatically. The following is a comparison of an array indexed by word (that would allow literal string searches), to a bitset addressed by verse (that does not) in the example of a search of multiple keywords: (i.e., continuting to show why I disallowed literal string searches in order to use the bitset)


Questions and comments:

Send questions and comments to clbailey@ucsd.edu (Or ask any time during the presentation)