Optimizing Machine Learning over Normalized Data
Machine Learning in R
Advanced analytics is a booming area in the data management industry and a hot research topic. Almost
all toolkits that implement machine learning (ML) algorithms assume that the input is a single table,
but most relational datasets are not stored as single tables due to normalization. Thus, analysts often
join tables to obtain a denormalized table. Also, analysts typically ignore any functional dependencies
among features because ML toolkits do not support them. In both cases, time is wasted in learning over
data with redundancy.
In order to mitigate the above issues, we are building Santoku, a prototype toolkit to help
analysts apply, and improve the performance of, ML directly over normalized data.
Santoku applies and extends the idea of factorized learning from our prior work on
Project Orion (in a nutshell, the idea is to "chop up" and
push ML computations down through joins to the base tables) to several ML models.
It also introduces the factorized scoring paradigm to evaluate the acuuracy of the learned
ML models directly over the normalized data.
Since the speedups of factorized learning/scoring depend on the data dimensions, Santoku uses a simple
cost model and cost-based optimizer to automatically decide whether to use the single table for
learning/scoring (possibly after denormalization) or apply factorized learning/scoring (possibly after
Santoku can also exploit database dependencies to provide automatic insights that could help analysts
with exploratory feature selection.
It is usable as a library in R, which is a popular environment for advanced analytics.
The benefits of Santoku in improving ML performance and helping with feature selection were
demonstrated at the VLDB 2015 conference.
Last Updated: Nov 2015