FOA Home | UP: Conclusions \& Future Directions

Things that are changing

AltaVista was, in 1995, arguably the first search engine offered for general use, and so AltaVista's history is especially interesting. At that time, AltaVista was developed by Digital Equipment Computer to primarily to demonstrate just how powerful their new Alpha architecture was, especially its then-novel 64-bit addressing and the consequentially vast data spaces. Indexing all the WWW's pages and providing a useful service to many simply was good publicity.

Since that time Digital Computer has been acquired by Compaq, and Altavista spun off to CMGI. As searching newly authored pages on the WWW has become increasingly profitably, similar search technology has been applied to exiting, traditionally published corpora to form the next generation of DIGITAL LIBRARIES [Fox98] [Paepcke98] . It is amazing how closely they resemble the vision H.G. Wells had of what a ``World Encyclopedia'' might mean, as early as 1938 [Wells38] !

As the Internet has reached a mass audience and these new search engine users begin to FOA in earnest, important new data is becoming available as to just how these real users (as opposed to most IR experimental subjects, cf. Section §4.3.1 ) behave. Silverstein et al. report on their analysis of approximately one billion $(10^9)$ queries issued against the AltaVista search engine during six weeks in August and September, 1998 [Silverstein99] . Another important qualification on this preliminary study is that no attempt was made to discriminate ``real,'' human-generated queries from automatic queries generated by robots. Still, several features of this study are signficant.

First, fully 15% of the queries were entirely empty; they contained no keywords! Two-thirds of these empty queries were generated within AltaVista's ``advanced query'' interface. Clearly, good interface design and user education remains a fundamental issue for effective search engine design.

Second, WWW searches use very short, simple queries, averaging only 2.3 keywords/query (and not including the zero-length queries in this average). Only 12.6% of queries used more than three keywords. Of course the fact that AltaVista's interface does not easily support longer, RelFbk queries (cf. Section §3.6 ) keeps these from occurring. Most users also avoid query syntax and issue simple queries: only 20% of queries used any of AltaVista's query operators ({\tt +, -, and, or, not, near}); half of these used only one operator.

These findings are especially signficant because they paint a much different picture of the ``typical user'' than IR has traditionally held. When IR systems were first developed, the target audience was primarily reference librarians, SEARCH INTERMEDIATES who helped library patrons find what they were seeking from sophisticated systems such as DIALOG. These librarians were specially educated, in particular in the subtleties of Boolean query operators and other sophisticated techniques for constructing exactly the right ``magic bullet'' query for a particular corpus. IR system design and theory therefore generally assumed that queries were fairly rich, structured expressions. At least at the moment, these assumptions do not seem to hold for most Web searching.

But despite the relatively simple form of most queries, the third interesting fact is that Web queries are rarely repeated. Even folding case and ignoring word order, only one third of queries appeared more than once in the billion queries; only 14% occurred more than three times. {Evidence for wide query novelty is especially striking given that, at least at this juvenille stage of Internet usage, by far the most dominant query topic is {\tt SEX}. Not only was {\tt SEX} the most common token, but sex-related terms dominated 17 of the top 25 most frequent query terms. {\tt MP3} and {\tt CHAT} were the most popular non-sex-related tokens, but their frequency was approximately a third of that of {\tt SEX}.} These statistics are especially significant in the face of new services such as AskJeeves which focus on providing especially relevant answers for a restricted set of anticipated queries.

Finally, Silverstein et al. attempted to analyze query sessions. Knowing just when a query is part of a session is notoriously difficult, especially when some queries are being generated by robots; this study used a combination of server-set cookies and a five-minute time window to capture coherent searches by the same user. It appears that 78% of query sessions involve only a single query, and that an average session involves only two queries! These data are preliminary, but provide an interesting contrast to the power law, Zipfian distribution of Web surfing behavior reported by Huberman et al. [Huberman98] (cf. Section §3.2.2 ).

The primary extension of the search engine technology developed so far in this text the CRAWLING function that must harvest web pages prior to their indexing. The design of web crawlers is now one of the most active areas of computer science research and we provide only a few basic references here.


Top of Page | UP: Conclusions \& Future Directions | ,FOA Home

FOA © R. K. Belew - 00-09-21