Text Searches of Scripture
by Cynthia Bailey, clbailey@ucsd.edu (class login: cs100wac)
Specification:
Rationalization:The reason I chose a hashtable is that I wanted something that was simple to use, but still very fast. Java's built-in utility Hashtable was perfect since it handled most of the mundane details automatically, but because the default collision resolution did not give me the control I needed over the data, I decided to do my own. A linked-list open hash was a natural solution since it is simple to implement and has variable length. Variable length was very important because there is a very large range in the numbers of words that will collide for a given three starting characters. (e.g., there is only one word that starts with "zar", but many words that start with "the" or "ste")
The BitSets were great because, again, they were such a natural fit and thus very simple to use. The problem if determining from a list of words, each which has a list of verse numbers, which words are in the same verse was quite a challenge, and without some tricky algorithm work could quickly become a gigantic use of time. Since the bitsets' intersection operation is a bitwise one, it is really fast, and also a perfect match for the problem. As mentioned above, the intersection also worked well for book restrictions as was the union for partial word searches. (ASIDE: A feature that could be added in the future would be to allow the user to disallow words in the verse and/or make a distinction between requiring a word be present and simply listing it as a possibility. Aside from some user interface busywork like parsing what the user wrote, the implementation of these would be fairly trivial using bitset operations.)
Summary of hashtable/bitset combination: ELEGANT! FAST!* (* and uses lots of space)
The array was picked because it has basically no overhead, is really fast, and I know the number of elements it needs beforehand since it is read out of the same file every time.
Limitations:
The hashtable/bitset implementation, although I still believe to be a very good choice, has some limitations:
- Space considerations: The hashtable takes up a lot of space, period. The bitset actually saved space for many of the words, in comparison to having a list of the occurences in the form of, for example, an array or a Vector of integers representing the verse or word numbers where it ocurred. Such a representation would have been extremely costly for words like "the", which occured in almost all the verses and over 19,000 times as an individual word. However, as the Zipfian distribution predicts, words like "the" are the minority, and there are many words that occur only once or twice. So, in the aggregate, the bitset was also a waste of space in comparison to other possible implementations. (see complete analysis below)
- Literal string search: Since the bitset's smallest (and only) addressing resolution is by verse, searching for literal strings (the words must occur in the order specified) is not possible. The bitset could in theory be addressed by word instead of verse, but it already took up quite a bit of space in memory, and addressing by word would have pushed it over the edge into an impossibility (at least on ieng9). However, if the user enters the 3 or 4 words as if they were a literal string, it is unusual that there will be very many other verses that are returned, other than the ones they wanted, so they can simply ignore those few.
- Partial word searches must be by at least 3 characters, and they must be the first 3: Since the hashtable for sticky hash mode intentionally collides all words based on their first 3 letters, doing a partial word search by the first one or two characters is not supported. As was the case with literal string searches, this problem in theory all but disappears in practice. After all, how many people (for a legitimate purpose, i.e. not just trying to break my program) are ever going to want to search out all instances of words starting with "c" or "ch"? I decided on the number three because that would produce what I thought to be just the right number of collisions. There are lots of fundamentally and symantically unrelated words that start with the same one or two letters, but three will allow the flexibility, yet restrictiveness that is about right. Also, because I have to do a linear search on the words in the linked list, a search like "th" "st" could have taken a really long time. I wanted to minimize that, since smart users that entered something like "start*" would also suffer. Lastly, on the point that it must be the first three letters, this is because making a hash of all possible 3-letter substrings of a word would have been a huge waste of space. I say waste because, going back to the "normal" use consideration, a user will usually use a partial-word search when they just want to allow different endings for the same word, for example, allowing different tenses of the same verb. Since most English words have such morphological changes at the end, it would be more rare to want to search the end of a word and find variable beginnings.
As mentioned above, I concede that the space (memory) usage of my program suffered in many ways at the hands of time interests. Also, one of the main problems with my implementation was that the interface does not include a search by literal strings, which was because of the bitset's large space consumption. So, I would like to present a defense of this interface concession in terms of a formal anaylsis of the space and time costs of bitset of verses vs. list of word locations in an array:
SPACE:
Lets say a user enters a single word for a search and does not give any restrictions on location. In this case, both bitset and array would take about the same time. This is because in both cases, the word will be looked up in the hashtable, and then a full enumeration of its verses will be given to the user, which will linear in both cases. However, if a user enters a whole list of 3, 5, or even 10 keywords, the situation changes dramatically. The following is a comparison of an array indexed by word (that would allow literal string searches), to a bitset addressed by verse (that does not) in the example of a search of multiple keywords: (i.e., continuting to show why I disallowed literal string searches in order to use the bitset)
Send questions and comments to clbailey@ucsd.edu (Or ask any time during the presentation)