Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication.

In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also extend all of our approaches to multi-table joins as well as to Hive.


Santoku is an R-based toolkit that applies and extends the idea of factorized learning to ML models with categorical features such as Naive Bayes and TAN. It also introduces the factorized scoring paradigm to evaluate the acuuracy of the learned ML models directly over the normalized data. Since the speedups of factorized learning/scoring depend on the data dimensions, Santoku uses a simple cost model and cost-based optimizer to automatically decide whether to use the single table for learning/scoring (possibly after denormalization) or apply factorized learning/scoring (possibly after normalization). Santoku can also exploit database dependencies to provide automatic insights that could help analysts with exploratory feature selection. It is usable as a library in R, which is a popular environment for advanced analytics. Santoku was demonstrated at VLDB in 2015.


  • Paper in ACM SIGMOD 2015
  • Orion on GitHub (source code for data synthesizer, Orion on PostgreSQL, and Orion on Hive)