Our ability to identify 'relevant' documents demands that we appreciate the CONTEXT within which they were originally written. The intended meaning of any document depends on the author's anticipated audience, and words specially selected to communicate to it. The cacophony we now see generated by modern Web search engines, as documents written for vastly different purposes and audiences are pulled into a single hit list, makes us acutely aware of the context that has been stripped away.
Our group has begun to focus on CITATIONS as an especially rich and ubiquitous source of data by which one document is anchored into a fabric composed of many others. The critical role of citation in the scientific and legal literatures has lead to especially deep analysis of 'bibliometric' patterns found there; we build on these results. Much of this work has focused on features of individual documents, such as counting the number of references to a cited work and using this as an indication of "impact."
We are especially interested in going beyond such "first order" features of individual documents to LEARN RELATIONS among documents. We hypothesize that there exist sufficient textual clues surrounding the citation source and in the cited document to allow several useful semantic features of the citation can be inferred. In the legal context, the obvious target for citation type categories are Shepard's. We propose to begin with the more unambiguous "history" labels, as syntactic facts arising from the dates, jurisdictions and standing of the two opinions, can be applied. More interesting, however, are the "treatment" labels. Previous work (with Dan Rose) considered a reduced dimensional space within which a large number of the labels might be compressed. If true, this will reduce the technical problems of inducing them considerably.