Learning Generalized Linear Models Over Normalized Data
Large-scale Machine Learning
Enterprise data analytics is a booming area in the data management industry.
Many companies are racing to develop toolkits that closely integrate statistical
and machine learning techniques with data management systems.
Almost all such toolkits assume that the input to a learning algorithm
is a single table.
However, most relational datasets are not stored as single tables due to normalization.
Thus, analysts often perform key-foreign key joins before learning on the join output.
This strategy of learning after joins introduces redundancy avoided
by normalization, which could lead to poorer end-to-end performance and
maintenance overheads due to data duplication.
In this work, we take a step towards enabling and optimizing
learning over joins for
a common class of machine learning techniques called generalized
linear models that are solved using gradient descent algorithms in an RDBMS setting.
We present alternative approaches to learn over a join that are
easy to implement over existing RDBMSs.
We introduce a new approach named factorized
learning that pushes ML computations through joins and avoids
redundancy in both I/O and computations.
We study the tradeoff space for all our approaches both analytically and empirically.
Our results show that factorized learning is often substantially faster
than the alternatives,
but is not always the fastest, necessitating a cost-based approach.
We also extend all of our approaches to multi-table joins
as well as to Hive.
Last Updated: June 2015