About me
Emiran Curtmola is currently a Teradata Database Query Optimizer Engineer.
He has earned his Ph.D. in Computer Science from UC San Diego.
Here, he was part of the UCSD Database
group and was affiliated with the CNS center where he collaborated with
Alin Deutsch
and
Yannis Papakonstantinou.
Research Interests
My research lies primarily in foundational aspects of Databases at
the intersection with information retrieval and distributed
information systems. My current focus is on query
optimization, unstructured data management, search (XML
full-text, algorithms and systems), XML technologies,
web-scale data integration and exchange, Semantic Web, distributed and
P2P computing, and data privacy.
Professional Service
· Program committee member:
EDBT 2010,
DEXA (2009, 2010, 2011),
DTA (2009, 2010),
ADC (2011)
· External conference reviewer:
VLDB'08
, VLDB PhD Workshop 2009
· Teaching Assistant at UC San Diego:
Database System Applications (CSE132B)
- Fall 2003, Spring 2008, Spring 2009
Server-side Web Applications (CSE135)
- Spring 2009
· Teaching Assistant at Polytechnic University of Bucharest, Romania:
Data Structures and Algorithm Analysis, Fundamentals of
Computer Graphics, Switching Theory and Logical Design,
Numerical Calculus
Internships
· IBM Almaden Research Center, USA, 2007-2008
Mentor: Fatma Özcan, Andrey Balmin
· AT&T Labs Research,
USA, 2004-2006
Mentor: Sihem Amer-Yahia, Divesh Srivastava
· Infineon Technologies AG, Germany, 2002-2003
Mentor: Raik Brinkmann, Hermann Ilmberger
|
|
|
Background
· Ph.D. from University of California San Diego
Computer Science and Engineering Department
Thesis: Democratic Community-based Search with XML Full-Text Queries [Abstract]
Publication Abstract
As the web evolves, it is becoming easier to form online communities
based on shared interests, and to create and publish data on a wide
variety of topics.
With this democratization of information creation, it is natural to
query, in an ad-hoc and expressive fashion, the global
collection that is the union of all local data collections of others
within the community. In order to publish and locate documents of
interest while fully delivering on the promise of free data exchange,
any community-supporting infrastructure needs to enforce the key
requirement to preserve privacy of the association of content
providers with potential sensitive information.
This privacy-preserving publishing requirement prevents censorship,
harassment, or discrimination of users by third parties.
It also precludes some obvious approaches that reuse and build on
existing centralized technologies including search engines and
hosted online communities.
This dissertation facilitates democratization of data publishing and
efficient search with powerful full-text queries over the community
global collection by means of a novel distributed framework that
disseminates queries in online communities.
We address two challenging issues that arise in this context: the
design of distributed access methods to publishers and the evaluation
of expressive queries (i.e., XML full-text) locally at the publisher
thereof.
First, given the virtual nature of the global data collection, we
study the problem of efficiently discovering publishers in the
community that contain documents matching a user query. We call such
peers relevant publishers.
We propose a novel distributed infrastructure in which data resides
only with the publishers owning it. The infrastructure disseminates
user queries to publishers, who answer them at their own discretion,
under data-location anonymity constraints. That is the query
forwarding infrastructure prevents leaking information about which
publishers are capable of answering a certain query.
Second, once queries reach relevant publishers, we study how they
efficiently process the incoming queries over their local
repositories.
Given that the commonly used data model for information exchange on the
Web is semi-structured (e.g., XML), we propose algorithms for
the evaluation and optimization of expressive XML queries that
integrate structured and full-text search, including the W3C XQuery
Full-Text standard.
· M.S. from University of California San Diego
Computer Science and Engineering Department
· B.S. from Polytechnic University of Bucharest, Romania
Computer Science and Engineering Department
Selected Talks
Papers in Conferences and Workshops
-
WikiAnalytics: Disambiguation of Keyword Search Results on Highly Heterogeneous Structured Data [Abstract]
 ,
In International Workshop on the Web and Databases.
WebDB 2010
Andrey Balmin
and
Emiran Curtmola
WikiAnalytics
IBM Research Report RJ10466, May 2010
Andrey Balmin
and
Emiran Curtmola
Publication Abstract
Wikipedia infoboxes is an example of a seemingly structured,
yet extraordinarily heterogeneous dataset,
where any given record has only a tiny fraction of all possible fields.
Such data cannot be queried using traditional means without a
massive a priori integration effort,
since even for a simple request the result values span many record types
and fields.
On the other hand, the solutions based on the keyword search
are too imprecise to exactly capture the user's intent.
To address these limitations, we propose
WikiAnalytics system that utilizes
a novel search
paradigm in order to derive tables of precise and complete results from
Wikipedia infobox records.
The user starts with a keyword search that finds a superset of the result records,
and then browses the clusters of the records deciding which are and are not relevant.
WikiAnalytics uses three categories of clustering features based on record
types, fields, and values that matched query keywords, respectively.
Since the system cannot predict which combination of features will be important
to the user, it efficiently generates all possible clusters of records by all sets of features.
We utilize a novel data structure, universal
navigational lattice (UNL), that compactly encodes all possible clusters.
WikiAnalytics provides a
dynamic and intuitive interface that lets the user explore the
UNL and construct homogeneous structured tables, which can
be further queried and aggregated using
the conventional tools.
-
Load-Balanced Query Dissemination in Privacy-Aware Online Communities [Abstract]
 ,
In ACM SIGMOD International Conference on Management of Data.
SIGMOD 2010
Emiran Curtmola,
Alin Deutsch,
K.K. Ramakrishnan,
and
Divesh Srivastava
Censorship-resistant Publishing
Technical Report CS2010-0956, UC San Diego, March 2010
Emiran Curtmola,
Alin Deutsch,
K.K. Ramakrishnan,
and
Divesh Srivastava
Publication Abstract
As the web evolves, it is becoming easier to form communities based on
shared interests, and to create and publish data on a wide variety of
topics. With this democratization of information creation comes the
natural desire to make one's data accessible for querying within the
community and also be able to query the global collection that is the
union of all local data collections of others within the community. In
order to fully deliver on the promise of free data exchange, any
community-supporting infrastructure needs to enforce the key
requirement to preserve privacy of the association of content
providers with potential sensitive published information.
This privacy preserving publishing requirement prevents censorship,
harassment, or discrimination of users by third parties. It also
precludes some obvious approaches that reuse and build on existing
centralized technologies, e.g., search engines, hosted online
communities, etc.
We propose a novel privacy-preserving enabling distributed
infrastructure in which data resides only with the publishers owning
it. The infrastructure disseminates user queries to publishers, who
answer them at their own discretion. The infrastructure enforces a
publisher k-anonymity} guarantee, which
prevents leakage of information about which publishers are capable of
answering a certain query.
Given the virtual nature of the global data collection, we study the
challenging problem of efficiently locating publishers in the community
that contain data items matching a specified query. We propose a
distributed index structure, UQDT, that is organized as a union of
Query Dissemination Trees (QDTs), and realized on an overlay (i.e.,
logical) network infrastructure. Each QDT has data publishers as its
leaf nodes, and overlay network nodes as its internal nodes; each
internal node routes queries to publishers, based on a summary of the
data advertised by publishers in its subtrees. We experimentally
evaluate design tradeoffs, and demonstrate that UQDT can maximize
throughput by preventing any overlay network node from becoming
a bottleneck.
-
WikiAnalytics: Ad-hoc Querying of Highly Heterogeneous Structured Data [Abstract]
In International Conference on Data Engineering.
ICDE 2010 Demonstration
Andrey Balmin
and
Emiran Curtmola
Publication Abstract
Searching and extracting meaningful information out of highly
heterogeneous datasets is a hot topic that received a lot of
attention. However, the existing solutions are based on either
rigid complex query languages (e.g., SQL, XQuery/XPath) which are
hard to use without full schema knowledge, without an expert
user, and which require up-front data integration. At the other
extreme, existing solutions employ keyword search queries over
relational databases as well as over semistructured data
which are too imprecise to specify exactly the user's
intent.
To address these limitations, we propose an alternative search
paradigm in order to derive tables of precise and complete results from a very
sparse set of heterogeneous records. Our approach allows users to
disambiguate search results by navigation along conceptual dimensions
that describe the records. Therefore, we cluster documents based on
fields and values that contain the query keywords. We build a universal
navigational lattice (UNL) over all such discovered clusters.
Conceptually, the UNL encodes all possible ways to group the
documents in the data corpus based on where the keywords hit.
We describe, WikiAnalytics, a system that facilitates data extraction
from the Wikipedia infobox collection. WikiAnalytics provides a
dynamic and intuitive interface that lets the average user explore the
search results and construct homogeneous structured tables, which can
be further queried and mashed up (e.g., filtered and aggregated) using
the conventional tools.
-
Search Driven Analysis of Heterogeneous XML Data [Abstract]
In Conference on
Innovative Data Systems Research. CIDR 2009
Andrey Balmin,
Latha Colby,
Emiran Curtmola,
Quanzhong Li, and
Fatma Özcan
Publication Abstract
Analytical processing on XML repositories is usually enabled by
designing complex data transformations that shred the documents
into a common data warehousing schema. This can be very time consuming
and costly, especially if the underlying XML data has a
lot of variety in structure, and only a subset of attributes constitutes
meaningful dimensions and facts. Today, there is no tool to explore
an XML data set, discover interesting attributes, dimensions and
facts, and rapidly prototype an OLAP solution.
In this paper, we propose a system, called SEDA (Search,
Explore, Discover and Analyze), that enables
users to start with simple keyword-style querying, and interactively
refine the query based on result summaries. SEDA then maps query
results onto a set of known, or newly created, facts and dimensions,
and derives a star schema and its instantiation to be fed into an
off-the-shelf OLAP tool, for further analysis.
-
XTreeNet: Democratic Community Search [Abstract]
In International
Conference on Very Large Data Bases. VLDB 2008 Demonstration
Emiran Curtmola,
Alin Deutsch,
Dionysios Logothetis,
K.K. Ramakrishnan,
Divesh Srivastava, and
Kenneth Yocum
Publication Abstract
We describe XTreeNet, a distributed query dissemination
engine which facilitates democratization of publishing and
efficient data search among members of online communities
with powerful full-text queries. This demonstration shows
XTreeNet in full action. XTreeNet serves as a proof of
concept for democratic community search by proposing a
distributed novel infrastructure in which data resides only
with the publishers owning it. Expressive user queries are
disseminated to publishers. Given the virtual nature of the
global data collection (e.g., the union of all local data published in
the community) our infrastructure efficiently locates the
publishers that contain matching documents with a specified
query, processes the complex full-text query at the
publisher and returns all relevant documents to querier.
-
SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data [Abstract]
In International Conference on Very Large Data Bases. VLDB 2008 Demonstration
Andrey Balmin,
Latha Colby,
Emiran Curtmola,
Quanzhong Li,
Fatma Özcan,
Sharath Srinivash, and
Zografoula Vagena
Publication Abstract
Keyword search in XML repositories is a powerful tool for interactive
data exploration. Much work has recently been done on making
XML search aware of relationship information embedded in
XML document structure, but without a clear winner in all data and
query scenarios. Furthermore, due to its imprecise nature, search
results cannot easily be analyzed and summarized to gain more insights
into the data. We address these shortcomings with SEDA: a
system for Search, Exploration, Discovery, and Analysis of XML
Data. SEDA is based on a paradigm of search and user interaction
to help users start with simple keyword-style querying and perform
rich analysis of XML data by leveraging both the content and structure
of the data. SEDA is an interactive system that allows the user
to refine her query iteratively to explore the XML data and discover
interesting relationships.
SEDA first employs a top-k algorithm to compute the most relevant
top-k answers fast, and returns tuples of nodes ranked by relevance.
SEDA provides several novel data structures and techniques
for efficient top-k computation over graph-structured XML data.
SEDA also computes all the contexts in which the query terms are
found and all the connection paths that connect the query terms in
the XML data. These two summaries enable the user to refine her
query by disambiguating the contexts and connections relevant to
her query. With the user feedback, the system has enough information
to compute all query results, not just the top-k. From the complete
results, SEDA automatically deduces a star schema, which
is then instantiated with the query results and augmented with additional
values required for a well-defined data cube. The tables
computed at this step are input into an OLAP engine for further
analysis.
-
A Platform for Search in the Big Web 2.0 [Abstract]
In SIGMOD 2007 PhD Workshop on Innovative Database Research.
IDAR 2007
Emiran Curtmola
Publication Abstract
The recent explosion of the amount of different types of information
being generated from so many different places
under different social types of interactions between users has
made search a hot topic for many research communities.
While the traditional web search focused on simple keyword
search and on references between pages, nowadays getting
the right information at the right time is getting harder all
the time posing a critical need for expressive, efficient, relevant
and flexible search tools.
We study the search in large-scale social systems by capturing
logically the natural way people search and discover
information: the relevance of keywords relative to the document
structure, the importance of references between pages and
the associations generated by the online social context.
We argue that the key for successful search is to provide a
strong theoretical basis to enable the development of theory
and practical optimization algorithms. We are the first to
show how to transfer the well-established relational world
expertise into keyword search. The thesis of this research is
to build a prototype based on this formalism and to demonstrate how we
can leverage it to address these search challenges.
-
Flexible and Efficient XML Search with Complex Full-Text Predicates [Abstract]
 ,
 ,
In ACM SIGMOD International Conference on Management of Data.
SIGMOD 2006
Sihem Amer-Yahia,
Emiran Curtmola, and
Alin Deutsch
Publication Abstract
Recently, there has been extensive research that generated a wealth of
new XML full-text query languages, ranging from simple Boolean search
to combining sophisticated proximity and order predicates on
keywords.
While computing least common ancestors of query terms was
proposed for efficient evaluation of conjunctive keyword queries by
exploiting the document structure, no such solution was developed to
evaluate complex full-text queries. We present efficient evaluation
algorithms based on a formalization of full-text XML queries in terms of keyword
patterns and an algebra which manipulates pattern matches. Our algebra
captures most existing languages and their varying semantics and our
algorithms combine relational query evaluation techniques with the
exploitation of document structure to process queries with complex
full-text predicates.
We show how scoring can be incorporated into our framework without
compromising the algorithms complexity. Our experiments show that
considering element nesting dramatically improves the performance of
queries with complex full-text predicates.
-
Rewriting Nested XML Queries Using Nested Views [Abstract]
 ,
In ACM SIGMOD International Conference on Management of Data.
SIGMOD 2006
Nicola Onose,
Alin Deutsch,
Yannis Papakonstantinou,
and Emiran Curtmola
Publication Abstract
We present and analyze an algorithm for equivalent rewriting of
XQuery queries using XQuery views, which is complete for a large
class of XQueries featuring nested FLWR blocks, XML construction
and join equalities by value and identity. These features pose
significant challenges which lead to fundamental extension of prior
work on the problems of rewriting conjunctive and tree pattern
queries. Our solution exploits the Nested XML Tableaux (NEXT)
notation which enables a logical foundation for specifying XQuery
semantics. We present a tool which inputs XQuery queries and
views and outputs an XQuery rewriting, thus being usable on top
of any of the existing XQuery processing engines. Our experimental
evaluation shows that the tool scales well for large numbers of
views and complex queries.
-
GalaTex: A Conformant Implementation of the XQuery Full-Text Language [Abstract]
In International
Workshop on XQuery Implementation, Experience and Perspectives.
XIME-P 2005
Emiran Curtmola,
Sihem Amer-Yahia,
Philip Brown,
and
Mary Fernández
Publication Abstract
We describe GALATEX, the first complete implementation of
XQuery Full-Text, a W3C specification that extends XPath 2.0 and
XQuery 1.0 with full-text search capabilities. XQuery Full-Text
provides composable full-text search primitives such as simple keyword
search, Boolean queries, and keyword-distance predicates.
GALATEX is intended to serve as a reference implementation for
XQuery Full-Text and as a platform for addressing new research
problems such as scoring full-text query results, optimizing XML
queries over both structure and text, and evaluating top-k queries
on scored results. GALATEX is an all-XQuery implementation initially
focused on completeness and conformance rather than on efficiency.
We describe its implementation on top of Galax, a complete
XQuery implementation and identify some performance challenges,
possible solutions, and their interactions with XQuery implementations.
Selected Posters
-
Querying XML Peers [Abstract]
Center for Networked
Systems. In CNS Research
Review 2008
Emiran Curtmola,
Alin Deutsch,
Yannis Papakonstantinou,
K.K. Ramakrishnan,
and
Divesh Srivastava
Publication Abstract
As the web evolves, it is becoming easier to form communities
based on shared interests, and to create and publish data on a wide
variety of topics. With this "democratization of information creation"
comes the natural desire to make one's data accessible for
querying within the community and also be able to query the global
collection that is the union of all local data collections of others
within the community.
In order to fully deliver on the promise of free data exchange,
any community-supporting infrastructure needs to enforce the key
requirement of being resistant to censorship by third parties, be they
of governmental, corporate, or of other special interest nature. Censorship
resistance precludes some obvious approaches that reuse
and build on existing centralized technologies, e.g., search engines,
hosted online communities, etc.
We propose a distributed censorship-resistant enabling infrastructure
in which data resides only with the publishers owning it. The
infrastructure disseminates user queries to publishers, who answer
them at their own discretion. The infrastructure prevents third
parties from pinpointing which publisher advertises what data (without
extensively colluding with or attacking community members).
Given the virtual nature of the global data collection, we study
the challenging problem of efficiently locating publishers in the
community that contain data items matching a specified query. We
propose a distributed index structure, UQDT, that is organized as
a union of Query Dissemination Trees (QDTs), and realized on an
overlay (i.e., logical) network infrastructure. Each QDT has data
publishers as its leaf nodes, and overlay network nodes as its internal
nodes; each internal node routes queries to publishers, based on
a summary of the data advertised by publishers in its subtree.
We experimentally evaluate design tradeoffs, and demonstrate
that UQDT can maximize throughput by preventing any overlay
network node from becoming a bottleneck.
-
GalaTex: A Conformant Implementation of the XQuery Full-Text Language [Abstract]
 ,
In International World
Wide Web Conference. WWW 2005
Emiran Curtmola,
Sihem Amer-Yahia,
Philip Brown,
and
Mary Fernández
Publication Abstract
We describe GalaTex, the first complete implementation of
XQuery Full-Text, a W3C specification that extends XPath
2.0 and XQuery 1.0 with full-text search. XQuery Full-Text
provides composable full-text search primitives such as keyword
search, Boolean queries, and keyword-distance predicates.
GalaTex is intended to serve as a reference implementation
for XQuery Full-Text and as a platform for
addressing new research problems such as scoring full-text
query results, optimizing XML queries over both structure
and text, and evaluating top-k queries on scored results.
GalaTex is an all-XQuery implementation initially focused
on completeness and conformance rather than on efficiency.
We describe its implementation on top of Galax, a complete
XQuery implementation.
-
Implementation and Open Research Issues in XML Full-Text Search [Abstract]
In New York Area DB/IR Day 2005
Emiran Curtmola,
Sihem Amer-Yahia,
and
Alin Deutsch
Publication Abstract
The increase of large XML repositories being made available lately has created
and determined the need to search both the structure and text content of XML
documents. While current XML query processing languages, XPath 2.0 and XQuery 1.0
which are the W3C recommended standards for querying XML documents, operate on
structured XML data, they are limited in expressing full-text queries.Recently,
the W3C has been working on XQuery Full-Text, a language that extends XPath and
XQuery with fully composable full-text search primitives such as phrase matching,
Boolean connectives, keyword-distance, stemming and thesauri.
In this poster, I will describe the data model and the query semantics as well as
different query evaluation strategies for XQuery Full-Text. I will also discuss
the architecture of GalaTex, the first conformant implementation of XQuery Full-Text,
which uses Galax as a complete XQuery processor. GalaTex is initially focused on
completeness and conformance rather than on efficiency. However, its main benefit
is to serve as a reference implementation for XQuery Full-Text and as a platform
for addressing new research ideas in XML full-text search. I will discuss ideas on
optimizing XML queries over both structure and text, providing a logical framework
for evaluating top-K answers based on score pruning, and full-text query equivalence.
A demonstration of GalaTex is provided at GALATEX and
will also be available along with this poster.
Project Demos
- XTreeeNet - Democratic Community Search (work in progress)
- SEDA: A System for Search, Exploration, Discovery and Analysis of XML Data
- GalaTex - XQuery Full-Text extension of XPath and XQuery Languages
- REFORM - A System for Rewriting XML Nested Queries Using Nested Views
|