SCIA Data Integration Tool
SCIA is a computer system written by Jenny Wang in joint work with Joseph Goguen. It is an interactive but highly automated tool for data integration, that is being used in the large NSF SEEK (Science Environment for Ecological Research) project, for generating mappings between data sets having different schemas of different kinds, and for matching ports in scientific workflows in the Kepler system. SCIA currently supports matching XML Schemas, DTDs, and relational schemas, but it is being extended to other data models, as well as to ontologies.

It is unreasonable to expect fully automatic tools for information integration; in particular, it is difficult to find correct schema matches, especially where there are n-to-m matches, semantic functions, conditions and/or diverse data models; it may not even be clear what correctness means in such situations. Most related tools only match nodes of schemas, so that separate, very time consuming hand editing is required for n-to-m matches, semantic functions, and conditions, but SCIA integrates these tasks, using a single convenient GUI, and also generates executable view for data translation. Whereas other tools try to do all matching automatically, and fully succeed only for relatively simple matches, SCIA finds critical points, which are the most difficult matches over large subschemas; these are where user input is maximally useful. After receiving user input (often in the form of graphically accepting generated suggestions), SCIA does as much as reasonable automatically, identifies the next critical point, and then iterates until the system and the user are both satisfied. Critical points are determined using a complex combination of matching algorithms, including element names, paths, data types, semantic types, natural language descriptions, constraints, and (innovatively) path contexts (i.e., hierarchical structure). Experimental results using real world data from bibliographic, biological, ecological and business datasets show that critical points can significantly reduce total user effort, and that path contexts can improve both accuracy and efficiency.

It is an apparently little recognized scandal that the database community has never defined some of the most basic concepts of that field, such as database and schema, although it has of course produced beautiful theories of specific kinds of databases and their schemas, especially for the relational case. However, general definitions are needed to support a heterogeneous data integration tool like SCIA, and it should also be noted that scientific data is more often found in spreadsheets and structured files, without any explicit metadata, than in relational databases with well specified schemas or ontologies. We have therefore developed a theory of abstract schemas and abstract schema morphisms, which gives a semantics for n-to-m matches with semantic functions and/or conditions over diverse data models. This theory provides the semantic foundation for SCIA; see Data, Schema, Ontology, and Logic Integration for the technical details.


To the Information Integation, Databases, and Ontologies page, where references can be found
To my systems homepage
Maintained by Joseph Goguen