Rapidly advancing technologies are changing the nature of research in almost every field of science, and many other fields as well. For example, cheap sensors can produce unprecedented volumes of information about ecological sites, cheap mass storage allows unprecedented amounts of such data to be preserved, and computer power allows it to be analyzed in unprecedentedly complex ways. A major difficulty in realizing the promise of all this arises from the need to integrate data from multiple sources, often formatted in incompatible ways, and even worse, represented using incompatible assumptions, some of which may be highly implicit. For example, data from one source may be based on weekly samples and measured in meters, while data from another source is sampled every 10 days, and measured in feet; an implicit assumption for the first source might be that missing data points are filled in with interpolated values, whereas the second source just omits them; they may also have been sampled at different times (e.g., noon vs. midnight). An example of a significant implicit variable is the elapsed time between taking a sample and analyzing it in the lab; if these elapsed times are sufficiently different at different sites, and if some substances decay rapidly, then measurements will have to be recalibrated in order to be compared meaningfully. Furthermore, the data to be analyzed may be stored in a variety of databases having different data models, formats, and platforms. Similar problems can arise in many other fields, including textual analysis, computer integrated manufacturing, molecular biology, and data mining.
The XML language for semi-structured data is rapidly gaining acceptance, and has been proposed as a solution for data integration problems, because it allows flexible coding and display of data, by using metadata to describe the structure of data (e.g. DTD or Schema). Although this is important, it is less useful than often thought, because it can only define the syntax of a class of documents. Moreover, even if an adequate semantics were available for each document class, this would still not support the integration of data that is represented in different ways, because it does not give any way to translate among the different datasets. In addition to dealing with datasets that appear in computer readable documents and databases, users may also want to compare the results of simulation packages with the empirical datasets. These also may involve still other formats and implicit assumptions. Very complex workflows can easily arise in contemporary scientific research and in industrial and commercial practice.
The research described in this webpage is intended to address such problems of data integration, through the construction of a general tool called SCIA, the use of ontologies, and the development of supporting theory, as described below. We are also interested in the critical exploration of the limitations of such tools and methods.
A promising approach is to develop tools that go beyond syntax by using semantic metadata. But despite some optimistic projections to the contrary, the representation of meaning, in anything like the sense that humans use that term, is far beyond current information technology. As explored in detail in fields such as Computer Supported Cooperative Work (CSCW), understanding the meaning of a document often requires a deep understanding of its social context, including how it was produced, how it is used, its role in organizational politics, its relation to other documents, its relation to other organizations, and much more, depending on the particular situation. Moreover, all these contexts may be changing at a rapid rate, as may the documents themselves, and the context of the data is also often both indeterminate and evolving. Another complication is that the same document may be used in multiple ways, some of which can be very different from others.
These complexities mean that it is unrealistic to expect any single semantics to adequately reflect the meaning of the documents of some class for every purpose. Most attempts to deal with these problems in existing literature and practice are either ad hoc ("We just wrote some Perl scripts") or else are what we may call "high maintenance" solutions, involving complex infrastructure, such as commercial relational databases, high volume data storage centers, and ontologies written in specialized languages to describe semantics. Solutions of the first kind are typically undocumented and cannot be reused, whereas solutions of the second kind require considerable effort from highly skilled computer professionals, which can be frustrating for application experts, due to the difficulty of discovering, communicating, formalizing and especially updating, all the necessary contextual information. For this reason, many application scientists prefer to avoid high maintenance solutions, and do the data integration themselves in an ad hoc manner (often using graduate students or other assistants).
One approach is to provide tools to make data integration using semantic metadata much easier for application scientists to do themselves. Section 3 below describes SCIA, a GUI tool that metadata integration engineers can use to generate mappings between a virtual master database and local databases, from which end-user queries can be answered. A second, more flexible, approach is described in Section 8, using an ultra high level programming language based on equational logic.
It is important to know what role data integration actually plays in particular research settings, in order to design tools that will be useful in actual practice. In particular, data integration problems can have significant social dimensions. For example, we have already mentioned ethnographic studies indicating that scientists often prefer simple tools that closely match current needs, rather than high maintenance general purpose tools, of the kind computer scientists might prefer.
Although ontologies are promising for certain applications, many difficult problems remain, in part due to the essentially syntactic nature of ontology languages (e.g. OWL), the computationally intractable nature of highly expressive ontology languages (such as KIF), and the difficulty of interoperability among the many existing ontology languages, as well as among the ontologies written in those languages. Difficulties of another kind stem from the unrealistic expectations engendered by the many exaggerated claims made in the literature.
The goal of research in Data, Schema and Ontology Integration and Information Integration in Institutions is to provide a rigorous foundation for information integration that is not tied to any specific representational or logical formalism, by using category theory to achieve independence from any particular choice of representation, and using institutions to achieve independence from any particular choice of logic. The information flow and channel theories of Barwise and Seligman are generalized to any logic by using institutions; following the lead of Kent, this is combined with the formal conceptual analysis of Ganter and Wille, and the lattice of theories approach of Sowa. We also draw on the early categorical general systems theory of Goguen as a further generalization of information flow, and draw on Peirce to support triadic satisfaction.
It is unreasonable to expect fully automatic tools for information integration; in particular, it is difficult to find correct schema matches, especially where there are n-to-m matches, semantic functions, conditions and/or diverse data models; it may not even be clear what correctness means in such situations. To help solve this, we develop a tool called SCIA for XML Schema and DTD matching, that finds those "critical points" where user input is maximally useful, does as much as reasonable automatically, identifies new critical points, and iterates these steps until convergence is achieved. Critical points are determined using path contexts and a combination of matching algorithms; a convenient GUI provides hints and accepts user input at critical points; and view generation supports data transformation and queries. Tests with various datasets show that critical points and path contexts can significantly reduce total user effort.
Semantic models are needed for mappings of XML DTDs and XML Schemas, relational and object oriented schemas, and even spreadsheets and structured files, all with integrity constraints. We have developed a theory of abstract schemas and abstract schema morphisms, which provides a semantics for n-to-m matches with semantic functions and/or conditions over diverse data models. The theory provides semantic foundations for our schema mapping tool.
Ontologies, in the sense of formal semantic theories for datasets (not the sense of academic philosophy), are increasingly being proposed, and even used, to support the integration of information that is stored in heterogeneous formats, especially in connection with the world wide web, but also for other, less chaotic, forms of distributed database. In particular, ontologies have been proposed as a key to the success of the so called "semantic web."
Formally speaking, an ontology is a theory over a logic. Although this may sound straightforward, ontologies unfortunately are proliferating almost as quickly as the datasets that they are meant to describe. Therefore, integrating datasets the semantics of which are given by different ontologies, will require that their ontologies be integrated first. This task is greatly complicated by the fact that many different languages are in use for expressing ontologies, including Owl, Ontologic, Flora, KIF, and RDF, each of which has its own logic. Therefore to integrate ontologies, it may be necessary first to integrate the logics in which they are expressed. Moreover, dataset integration will also have to take account of the fact that the schemas describing structure are also often expressed in different languages, reflecting different underlying data models, e.g., relational, object oriented, spreadsheet, and formatted file.
This tangle of questions can be approached using the theory of institutions, which provides an axiomatization of the notion of logical system, based on Tarski's idea that the notion of satisfaction is central. One can then define theories over an institution, and theory morphisms can be used for translating ontologies over a given logic. The further notion of institution morphism is needed for translating between different logical systems (see the paper Institution Morphisms, by Joseph Goguen and Grigore Rosu), and morphisms of theories over different institutions are accommodated by Diaconescu's Grothendieck institution construction, as discussed in the papers Data, Schema and Ontology Integration and Information Integration in Institutions. Some limitations of ontologies are discussed in Ontology, Ontotheology, and Society. It is intended to extend SCIA to handle ontology integration, and to take advantage of ontologies in dataset integration.
Another research project in our group, called algebraic semiotics, is also useful here; it has the goal of developing a scientific understanding of basic issues of usability, representation and coordination that arise in interface design and related areas, especially the visualization of scientific data, and the organization of complex information using multimedia resources; there is also some focus on distributed cooperative work and on semiotics. For details, see the Short Overview of Algebraic Semiotics and the Brief Annotated Bibliography given there, as well as the User Interface Design homepage, and the UCSD course CSE 271.
Data integration is a good topic for combining our interests in algebraic semantics, user interface design with algebraic semiotics, and the sociology of science and technology (see the Sociology of Technology page, and the UCSD course CSE 275 for further information on our approach to these areas, which can help reveal what users of data integration services really need).
A different approach is to use the extremely high level BOBJ language (supplemented with standard programs for functions like reading data and doing statistical calculations). BOBJ is an algebraic programming and specification language, a recent member of the OBJ family, having a modern implementation in Java. It supports very high level abstractions, and very powerful parameterized modularization. Its user-definable syntax and semantics allows users to quickly define both the syntax and semantics of complex data structures, and its pattern matching makes it easy to define translations among data structures. In addition, its high level of abstraction makes it easier to write code, and its modularization makes it easier to reuse code; these points are especially important because our research shows that programs of this kind typically evolve through many iterations. An additional advantage of this approach is that users can also define their own application-oriented query languages, instead of being forced to use SQL (though SQL will still be needed to access some legacy databases, and it is used for illustration in the sample program given below). Unsurprisingly, BOBJ executes more slowly than conventional languages (since it is an interpreter rather than a compiler), but it should easily be adequate for the proposed applications.
To test this, we are writing programs to see if particular problems can be solved with relatively little effort, using concise, modular, reusable code. In future, we hope to test the feasibility of our approach with larger case studies on real problems in ecology, using datasets developed by the Long Term Ecologial Research (LTER) project, and stored in facilities administed by the San Diego Supercomputer Center (SDSC). We hope to support light-weight, user-produced data integration programs.
BOBJ is a logical programming language, in the sense that it (unlike Prolog) is rigorously based on a logic, in this case, three variants of order sorted equational logic, called loose, initial, and hidden. This means that any program written in BOBJ is a precise mathematical description of what it does, thus facilitating semantic integration and verification. For a general introduction to BOBJ, see the BOBJ entry in the OBJ family homepage. For more detail and examples, see the BOBJ language homepage.
Some initial experiments have been with bibliographies, since there are
many of these on the web using XML. The sample BOBJ code (written by Kai
Lin) is in the file data1.obj. Despite its
small size (especially if we exclude the LIST
module, which
merely defines a standard data structure), this code does a lot of work: it
accepts and parses a user query (in a subset of SQL), translates it into two
database queries (also in SQL), integrates the answsers, and then returns the
result to the user. The last two blocks, beginning "red query
"
are not code at all, but rather test cases, illustrating how to use the code.
The output from running these two test cases is given in the file named out.
Unfortunately, it would take considerable space to explain this code in detail to readers who do not already have some background in OBJ, while on the other hand, readers who do have such a background (and who also know a little SQL) would find such an explanation largely redundant. Some tutorial material on algebraic semantics and OBJ may be found in the author's version of the UCSD gradute course CSE 230, on programming languages. A more recent, somewhat improved, version of this code may be found at data2.obj.