Advanced analytics is a booming area in the data management industry and a hot research topic. Almost all toolkits that implement machine learning (ML) algorithms assume that the input is a single table, but most relational datasets are not stored as single tables due to normalization. Thus, analysts often join tables to obtain a denormalized table. Also, analysts typically ignore any functional dependencies among features because ML toolkits do not support them. In both cases, time is wasted in learning over data with redundancy.

In order to mitigate the above issues, we are building Santoku, a prototype toolkit to help analysts apply, and improve the performance of, ML directly over normalized data. Santoku applies and extends the idea of factorized learning from our prior work on Project Orion (in a nutshell, the idea is to "chop up" and push ML computations down through joins to the base tables) to several ML models. It also introduces the factorized scoring paradigm to evaluate the acuuracy of the learned ML models directly over the normalized data.

Since the speedups of factorized learning/scoring depend on the data dimensions, Santoku uses a simple cost model and cost-based optimizer to automatically decide whether to use the single table for learning/scoring (possibly after denormalization) or apply factorized learning/scoring (possibly after normalization). Santoku can also exploit database dependencies to provide automatic insights that could help analysts with exploratory feature selection. It is usable as a library in R, which is a popular environment for advanced analytics.

The benefits of Santoku in improving ML performance and helping with feature selection were demonstrated at the VLDB 2015 conference.