Efficient I/O-Intensive Computation in the Cloud
The public cloud has
been touted as a solution for effectively running arbitrary computation at
scale. In order to assess the viability of the public cloud's ability to run
efficient I/O-intensive computation, and also to assess our upgrades to Themis
at a larger-scale, we measured the performance of Themis on a wide variety of
public cloud VM configurations. We extensively measured the performance of
Themis on Amazon EC2 and Amazon EBS and explored its performance on Google
Compute Engine. As a result of this work, we set three new world records in
high speed sorting: 2014 Daytona GraySort, 2014 Indy CloudSort, and 2014
Daytona CloudSort. This work was published in
SoCC 2015.
Next Generation Cluster Technologies
TritonSort and Themis were
originally designed and evaluated on a 2009-era cluster consisting of disks and
10 Gb/s networking. While these technologies are still commonplace, newer
technologies with higher levels of performance are rapidly appearing in the
data center. Flash in particular appears to be eclipsing disk as a popular
storage media, as is evidenced by the storage offerings within the recent
Amazon EC2 instance types, as well as Google Compute Engine. Consequently, we
upgraded Themis to perform efficiently on newer cluster technologies. These
technologies include traditional SATA flash SSDs, high-performance PCI-Express
flash devices, 40 Gb/s Ethernet, and HPC supercomputers using high performance
interconnects such as Infiniband.
Themis
Themis is a MapReduce
implementation build
around
TritonSort. Themis is I/O-efficient in that achieves the minimum number of I/Os possible
(2) when the amount of data greatly exceeds the amount of physical
memory. Themis has been evaluated on a variety of common MapReduce jobs and
performs at roughly the same record-breaking speed as its predecessor,
TritonSort. Themis was published in
SoCC 2012.
TritonSort
TritonSort is the world's
fastest sorting system. TritonSort achieves record speeds by focusing on
per-disk and per-node efficiency. TritonSort aims to sort data at the speed of
the disks by keeping all disks constantly reading or writing data in large
contiguous chunks. TritonSort set world records in the 2010 and 2011
Sortbenchmark.org competitions. We set
a total of seven world records: 2010 Indy GraySort, 2010 Indy MinuteSort, 2011
Daytona GraySort, 2011 Indy GraySort, 2011 Indy MinuteSort, 2011 Daytona 100TB
JouleSort, 2011 Indy 100TB JouleSort. Two of these are current world records. TritonSort was published in
NSDI 2011.
Other Work
During summer of 2009, I investigated resource imbalance
within the MapReduce framework of
Hadoop.
Goals consisted of analyzing cluster resources during a MapReduce job,
identifying bottlenecks, and classifying various types of jobs according to
these bottlenecks with the hope of being able to utilize cluster resources more
efficiently.