Department of Computer Science and Engineering CSE 250A
University of California at San Diego Winter 2001

Team Project 1

DUE MONDAY JANUARY 29, BEFORE 4:30 PM.



The purpose of this assignment is to select and apply a heuristic search algorithm for finding information in real-time on the web.

The task is to design and implement a web crawler that finds web pages containing a desired type of content.  For example, one specific objective might be to find mp3 files containing music by Led Zeppelin.  Another specific objective might be to find resumes of software engineers posted less than a week ago.  You should choose an interesting objective, but also make your software reusable for other objectives.

Your principal research contribution should be an intelligent search algorithm that uses heuristic knowledge to decide which links to explore next, attempting to find as many pages with the desired content as quickly as possible, while exploring as few unnecessary other pages as possible.

For an overview of the issues involved in this project, see Efficient Web Spidering with Reinforcement Learning by Jason Rennie and Andrew McCallum, ICML'99.  Note that the algorithm ultimately proposed by this paper is a simple approximation of reinforcement learning that resembles a standard non-adaptive heuristic search method.  For this project, you should not attempt to implement any learning or adaptiveness.

See also Focused Crawling Using Context Graphs by M. Diligenti, F. Coetzee, S. Lawrence, C.L. Giles, M. Gori, VLDB'2000.  In general, whenever you are reading technical papers, and especially when you are only looking for relevant ideas to use and/or cite, remember that you do not need to understand the whole paper.  Almost always, most of what you need is in just part of the paper.  Usually, the most important and useful ideas are near the beginning and are less technical.  Often you do not need to understand the details of someone else's technical paper.  However, remember that all the details in any paper you write must be exactly correct.

You should not attempt to build a web crawler from scratch.  Instead, you should reuse existing software as much as possible while adding your own heuristic search algorithm.  Here are recommended starting points:

You may use commercial software if you have access to it, for example the IBM Web Crawler.  However if you do so, you must have documented authorization to discuss the details of your work and your experimental results in a paper available to the public.

Be sure that your web crawler does not cause excessive load on any servers or network connections.  For a discussion of the systems programming issues involved in creating a web crawler, see the papers of the Mercator project.  For a very different approach to crawling the web, see the InfoSpiders project of Filippo Menczer (UCSD CSE Ph.D., 1998).

Some of the issues to consider in your work and to discuss in your paper include the following:

You should also address explicitly and discuss in your paper high-level systems issues.  In particular, how can you maximize the number of pages visited per second by your crawler?  What is the effect of a low or high number of parallel threads trying to load web pages?  What is the effect of a short or long timeout for pages that are slow to load?  Do not get into low-level systems programming issues.

The most important part of your paper should be a description of the results of experiments performed with your software.  Use the papers mentioned above as inspiration for the design of your experiments.  Make sure that the experiments are easy to understand, so that the implications of the results obtained are as clear as possible.  The experiments may be preliminary and small in scale, but they should be designed so that you could perform similar definitive experiments if time and resources were available.  For more information on how to design experiments and how to report their results, see these Final Project Report instructions written by Dale Schuurmans of the University of Waterloo.

All your work should be described in a well-written report that follows all the CSE 250A guidelines.  As an appendix to your report, you should attach a printout of your software, which should be adequately commented and documented.  Reports will be evaluated with these grading criteria.

The authors of one or more of the best reports submitted will be asked to revise their report(s) for distribution to all CSE 250A students.  These authors will receive extra credit of 25% of the total credit available for the project, and they will be encouraged to submit a paper for publication, if appropriate.