UC San Diego - CSE
 
Home
My links

Peru INCA train

Emiran Curtmola
Emiran Curtmola

Has joined Teradata
ecurtmola -at- cs ucsd edu
Old contact info:

Computer Science and Engineering
University of California San Diego
9500 Gilman Drive
La Jolla, CA 92093-0404

Phone: +1 (858)-534-9913
Fax: +1 (858)-534-7029
Emiran Curtmola

About me

I am part of the database group at UCSD, Computer Science where I work with Alin Deutsch and Yannis Papakonstantinou. I am also affiliated with the CNS center.

Research Interests
My research lies primarily in foundational aspects of Databases at the intersection with information retrieval and distributed information systems. My current focus is on query optimization, unstructured data management, search (XML full-text, algorithms and systems), XML technologies, web-scale data integration and exchange, Semantic Web, distributed and P2P computing, and data privacy.

Professional Service
  · Program committee member: EDBT 2010, DEXA 2009, 2010, DTA 2009

  · External conference reviewer: VLDB'08 , VLDB PhD Workshop 2009

  · Teaching Assistant at UC San Diego:
    Database System Applications (CSE132B) - Fall 2003, Spring 2008, Spring 2009
    Server-side Web Applications (CSE135) - Spring 2009

  · Teaching Assistant at Polytechnic University of Bucharest, Romania:
    Data Structures and Algorithm Analysis, Fundamentals of Computer Graphics, Switching Theory and Logical Design, Numerical Calculus

Internships
  · IBM Almaden Research Center, USA, 2007-2008
    Mentor: Fatma Özcan, Andrey Balmin

  · AT&T Labs Research, USA, 2004-2006
    Mentor: Sihem Amer-Yahia, Divesh Srivastava

  · Infineon Technologies AG, Germany, 2002-2003
    Mentor: Raik Brinkmann, Hermann Ilmberger

Background
  · Ph.D. from University of California San Diego
    Computer Science and Engineering Department
    Thesis: Democratic Community-based Search with XML Full-Text Queries

  · M.S. from University of California San Diego
    Computer Science and Engineering Department

  · B.S. from Polytechnic University of Bucharest, Romania
    Computer Science and Engineering Department

Selected Talks

Papers in Conferences and Workshops

  • WikiAnalytics: Ad-hoc Querying of Highly Heterogeneous Structured Data
         International Conference on Data Engineering. ICDE 2010 Demonstration
          Andrey Balmin and Emiran Curtmola

    Abstract

    Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases as well as over semistructured data which are too imprecise to specify exactly the user's intent.

    To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit.

    We describe, WikiAnalytics, a system that facilitates data extraction from the Wikipedia infobox collection. WikiAnalytics provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools.
  • Search Driven Analysis of Heterogenous XML Data
          Conference on Innovative Data Systems Research. CIDR 2009
          Andrey Balmin, Latha Colby, Emiran Curtmola, Quanzhong Li, and Fatma Özcan

    Abstract

    Analytical processing on XML repositories is usually enabled by designing complex data transformations that shred the documents into a common data warehousing schema. This can be very time consuming and costly, especially if the underlying XML data has a lot of variety in structure, and only a subset of attributes constitutes meaningful dimensions and facts. Today, there is no tool to explore an XML data set, discover interesting attributes, dimensions and facts, and rapidly prototype an OLAP solution.

    In this paper, we propose a system, called SEDA (Search, Explore, Discover and Analyze), that enables users to start with simple keyword-style querying, and interactively refine the query based on result summaries. SEDA then maps query results onto a set of known, or newly created, facts and dimensions, and derives a star schema and its instantiation to be fed into an off-the-shelf OLAP tool, for further analysis.
  • XTreeNet: Democratic Community Search
          International Conference on Very Large Data Bases. VLDB 2008 Demonstration
          Emiran Curtmola, Alin Deutsch, Dionysios Logothetis, K.K. Ramakrishnan, Divesh Srivastava, and Kenneth Yocum

    Abstract

    We describe XTreeNet, a distributed query dissemination engine which facilitates democratization of publishing and efficient data search among members of online communities with powerful full-text queries. This demonstration shows XTreeNet in full action. XTreeNet serves as a proof of concept for democratic community search by proposing a distributed novel infrastructure in which data resides only with the publishers owning it. Expressive user queries are disseminated to publishers. Given the virtual nature of the global data collection (e.g., the union of all local data published in the community) our infrastructure efficiently locates the publishers that contain matching documents with a specified query, processes the complex full-text query at the publisher and returns all relevant documents to querier.
  • SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data
          International Conference on Very Large Data Bases. VLDB 2008 Demonstration
          Andrey Balmin, Latha Colby, Emiran Curtmola, Quanzhong Li, Fatma Özcan, Sharath Srinivash, and Zografoula Vagena

    Abstract

    Keyword search in XML repositories is a powerful tool for interactive data exploration. Much work has recently been done on making XML search aware of relationship information embedded in XML document structure, but without a clear winner in all data and query scenarios. Furthermore, due to its imprecise nature, search results cannot easily be analyzed and summarized to gain more insights into the data. We address these shortcomings with SEDA: a system for Search, Exploration, Discovery, and Analysis of XML Data. SEDA is based on a paradigm of search and user interaction to help users start with simple keyword-style querying and perform rich analysis of XML data by leveraging both the content and structure of the data. SEDA is an interactive system that allows the user to refine her query iteratively to explore the XML data and discover interesting relationships.

    SEDA first employs a top-k algorithm to compute the most relevant top-k answers fast, and returns tuples of nodes ranked by relevance. SEDA provides several novel data structures and techniques for efficient top-k computation over graph-structured XML data. SEDA also computes all the contexts in which the query terms are found and all the connection paths that connect the query terms in the XML data. These two summaries enable the user to refine her query by disambiguating the contexts and connections relevant to her query. With the user feedback, the system has enough information to compute all query results, not just the top-k. From the complete results, SEDA automatically deduces a star schema, which is then instantiated with the query results and augmented with additional values required for a well-defined data cube. The tables computed at this step are input into an OLAP engine for further analysis.
  • A Platform for Search in the Big Web 2.0
          SIGMOD 2007 PhD Workshop on Innovative Database Research. IDAR 2007
          Emiran Curtmola

    Abstract

    The recent explosion of the amount of different types of information being generated from so many different places under different social types of interactions between users has made search a hot topic for many research communities. While the traditional web search focused on simple keyword search and on references between pages, nowadays getting the right information at the right time is getting harder all the time posing a critical need for expressive, efficient, relevant and flexible search tools.

    We study the search in large-scale social systems by capturing logically the natural way people search and discover information: the relevance of keywords relative to the document structure, the importance of references between pages and the associations generated by the online social context. We argue that the key for successful search is to provide a strong theoretical basis to enable the development of theory and practical optimization algorithms. We are the first to show how to transfer the well-established relational world expertise into keyword search. The thesis of this research is to build a prototype based on this formalism and to demonstrate how we can leverage it to address these search challenges.
  • Flexible and Efficient XML Search with Complex Full-Text Predicates , ,
          ACM SIGMOD International Conference on Management of Data. SIGMOD 2006
          Sihem Amer-Yahia, Emiran Curtmola, and Alin Deutsch

    Abstract

    Recently, there has been extensive research that generated a wealth of new XML full-text query languages, ranging from simple Boolean search to combining sophisticated proximity and order predicates on keywords.

    While computing least common ancestors of query terms was proposed for efficient evaluation of conjunctive keyword queries by exploiting the document structure, no such solution was developed to evaluate complex full-text queries. We present efficient evaluation algorithms based on a formalization of full-text XML queries in terms of keyword patterns and an algebra which manipulates pattern matches. Our algebra captures most existing languages and their varying semantics and our algorithms combine relational query evaluation techniques with the exploitation of document structure to process queries with complex full-text predicates.

    We show how scoring can be incorporated into our framework without compromising the algorithms complexity. Our experiments show that considering element nesting dramatically improves the performance of queries with complex full-text predicates.
  • Rewriting Nested XML Queries Using Nested Views ,
          ACM SIGMOD International Conference on Management of Data. SIGMOD 2006
          Nicola Onose, Alin Deutsch, Yannis Papakonstantinou, and Emiran Curtmola

    Abstract

    We present and analyze an algorithm for equivalent rewriting of XQuery queries using XQuery views, which is complete for a large class of XQueries featuring nested FLWR blocks, XML construction and join equalities by value and identity. These features pose significant challenges which lead to fundamental extension of prior work on the problems of rewriting conjunctive and tree pattern queries. Our solution exploits the Nested XML Tableaux (NEXT) notation which enables a logical foundation for specifying XQuery semantics. We present a tool which inputs XQuery queries and views and outputs an XQuery rewriting, thus being usable on top of any of the existing XQuery processing engines. Our experimental evaluation shows that the tool scales well for large numbers of views and complex queries.
  • GalaTex: A Conformant Implementation of the XQuery Full-Text Language
          International Workshop on XQuery Implementation, Experience and Perspectives. XIME-P 2005
          Emiran Curtmola, Sihem Amer-Yahia, Philip Brown, and Mary Fernández

    Abstract

    We describe GALATEX, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search capabilities. XQuery Full-Text provides composable full-text search primitives such as simple keyword search, Boolean queries, and keyword-distance predicates. GALATEX is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GALATEX is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation and identify some performance challenges, possible solutions, and their interactions with XQuery implementations.

Selected Posters

  • Querying XML Peers
          Center for Networked Systems. CNS Research Review 2008
          Emiran Curtmola, Alin Deutsch, Yannis Papakonstantinou, K.K. Ramakrishnan, and Divesh Srivastava

    Abstract

    As the web evolves, it is becoming easier to form communities based on shared interests, and to create and publish data on a wide variety of topics. With this "democratization of information creation" comes the natural desire to make one's data accessible for querying within the community and also be able to query the global collection that is the union of all local data collections of others within the community.

    In order to fully deliver on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement of being resistant to censorship by third parties, be they of governmental, corporate, or of other special interest nature. Censorship resistance precludes some obvious approaches that reuse and build on existing centralized technologies, e.g., search engines, hosted online communities, etc.

    We propose a distributed censorship-resistant enabling infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion. The infrastructure prevents third parties from pinpointing which publisher advertises what data (without extensively colluding with or attacking community members).

    Given the virtual nature of the global data collection, we study the challenging problem of efficiently locating publishers in the community that contain data items matching a specified query. We propose a distributed index structure, UQDT, that is organized as a union of Query Dissemination Trees (QDTs), and realized on an overlay (i.e., logical) network infrastructure. Each QDT has data publishers as its leaf nodes, and overlay network nodes as its internal nodes; each internal node routes queries to publishers, based on a summary of the data advertised by publishers in its subtree.

    We experimentally evaluate design tradeoffs, and demonstrate that UQDT can maximize throughput by preventing any overlay network node from becoming a bottleneck.
  • GalaTex: A Conformant Implementation of the XQuery Full-Text Language ,
          International World Wide Web Conference. WWW 2005
          Emiran Curtmola, Sihem Amer-Yahia, Philip Brown, and Mary Fernández

    Abstract

    We describe GalaTex, the first complete implementation of XQuery Full-Text, a W3C specification that extends XPath 2.0 and XQuery 1.0 with full-text search. XQuery Full-Text provides composable full-text search primitives such as keyword search, Boolean queries, and keyword-distance predicates. GalaTex is intended to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research problems such as scoring full-text query results, optimizing XML queries over both structure and text, and evaluating top-k queries on scored results. GalaTex is an all-XQuery implementation initially focused on completeness and conformance rather than on efficiency. We describe its implementation on top of Galax, a complete XQuery implementation.
  • Implementation and Open Research Issues in XML Full-Text Search
          New York Area DB/IR Day 2005
          Emiran Curtmola, Sihem Amer-Yahia, and Alin Deutsch

    Abstract

    The increase of large XML repositories being made available lately has created and determined the need to search both the structure and text content of XML documents. While current XML query processing languages, XPath 2.0 and XQuery 1.0 which are the W3C recommended standards for querying XML documents, operate on structured XML data, they are limited in expressing full-text queries.Recently, the W3C has been working on XQuery Full-Text, a language that extends XPath and XQuery with fully composable full-text search primitives such as phrase matching, Boolean connectives, keyword-distance, stemming and thesauri.

    In this poster, I will describe the data model and the query semantics as well as different query evaluation strategies for XQuery Full-Text. I will also discuss the architecture of GalaTex, the first conformant implementation of XQuery Full-Text, which uses Galax as a complete XQuery processor. GalaTex is initially focused on completeness and conformance rather than on efficiency. However, its main benefit is to serve as a reference implementation for XQuery Full-Text and as a platform for addressing new research ideas in XML full-text search. I will discuss ideas on optimizing XML queries over both structure and text, providing a logical framework for evaluating top-K answers based on score pruning, and full-text query equivalence.

    A demonstration of GalaTex is provided at GALATEX and will also be available along with this poster.

Project Demos

  • XTreeeNet - Democratic Community Search (work in progress)

  • SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data

  • GalaTex - XQuery Full-Text extension of XPath and XQuery Languages

  • REFORM - A System for Rewriting XML Nested Queries Using Nested Views