FOA Home | UP: Inference beyond the \Index


For some time the study of evolution has demanded an especially interdisciplinary approach, and so it is no wonder that profound difficulties in communication arise as scientists trained in paradigms as varied as biology, psychology and even computer science [REF1123] attempt to communicate with one another. Now, of course, theories of evolution are increasingly informed by huge volumes of concrete data, generated by the Human Genome Project and related efforts. Serendipitously, the field of molecular biology is also one of the first (but quite certainly not the last) disciplines to undergo a qualitative change because of the WWW. The nearly simultaneous growth of the WWW and genomic databases has meant that computational biology as a science has grown up with a very advanced notion of publication. Beyond formal publication channels, even beyond informal email and discussion groups, the genomic databases at the heart of molecular biology today may point to forms of communication among scientists which are arguably, like the image-based WWW traffic, POST-VERBAL .

The flood of biological sequence data -- nucleic acid, proteins, and now gene expression networks, metabolic pathways -- into sequence databases, with the related flood of molecular biology literature, represents an unprecedented opportunity to investigate how concepts learned automatically from various data sets relate to the words and phrases used by scientists to describe them. Learning this linkage -- between molecular biology concepts and the genomic data relating to them -- can be described as annotating the data. It is now possible to learn many of these correspondences automatically, guided by the RelFbk of practicing scientists, as a natural by-product of their browsing through genome data and publications related to them. RelFbk provides a key additional piece of information to learning algorithms, beyond the statistical correlations that may exist within the genome data or textual corpora treated independently: It captures the fact that a scientist who understands both the sequence data and the journal articles deeply does (or does not) believe that a particular sequence and particular keyword/concept share a common referent. Sequences are posted, annotations are often automatically constructed based on HOMOLOGOUS relations to other sequences found in the databases. A different variety of ``sequence search engines,'' specially developed to look for similarities among sequences rather than among documents, become the basis for retrievals. These retrievals can and often do connect the work of one scientist to that of another without a single verbal expression passing.

Figure (figure) sketches the basic relations. On the bottom are the most fundamental classes of molecular data, namely gene and protein sequences. On the top is a set of scientific documents, such as those found in MEDLINE. The primary relation connecting between the raw genetic data and textual corpora are ANNOTATION links that scientists have (manually) established between articles and sequences are both significant and useful. They are significant because they help to establish the construction of the genome as a piece of the scientific enterprise, linking it to the traditions of academic publication. They are also useful to many scientists who, for example, are interested in a particular gene or protein and want to find out all that others might know about it. But annotation is not done consistently by all participating scientists, nor has a precise semantics for what exactly an annotation should mean been established. The Entrez interface to MEDLINE makes it convenient for a user with a particular sequence in mind to find its corresponding publication, and vice versa. Together with the MESH thesaurus of medical terms (cf. Section §6.3 ), these features make the National Library of Medicine's resource one of the most advanced on the WWW.

In addition to expediting the searches of scientists and doctors, the identification of significant patterns in one modality (i.e., in text or in sequence data) can be used to suggest hypotheses in the other (similar to suggestions made by Swanson (cf. Section §6.5.3 ). Also shown in the figure are $\mathcal{S}im$ arcs relating ``similar'' data. In the case of genetic or protein sequence data, these similarity measures are typically based on a notion of ``edit distance'' generated by string-matching tools such as BLAST and FastA, but the investigation of new methods for this problem is one of the most active areas within machine learning (cf. [Glasgow96] ). The investigation of inter-document similarities has been an important problem within the field of information retrieval (IR) for many decades. Most document similarity measures are based on correlations between ``keywords'' contained by pairs of documents, but other methods (e.g., based on a bibliometric analysis of shared entries in the documents' bibliographies) have also received considerable attention.

Top of Page | UP: Inference beyond the \Index | ,FOA Home

FOA © R. K. Belew - 00-09-21