To Join or Not to Join?
Thinking Twice about Joins before Feature Selection
Feature Selection / Machine Learning
Key-Foreign Key Joins / Database Dependencies
Statistical Learning Theory
Closer integration of machine learning (ML) with data processing is a
booming area in both the data management industry and academia.
Almost all ML toolkits assume that the input is a single table, but many
datasets are not stored as single tables due to normalization.
Thus, analysts often perform key-foreign key joins to obtain features
from all base tables and apply a feature selection method, either
explicitly or implicitly, with the aim of improving accuracy.
In this work, we show that the features brought in by such
joins can often be ignored without affecting ML accuracy significantly,
i.e., we can "avoid joins safely."
We identify the core technical issue that could cause accuracy to decrease
in some cases and analyze this issue by applying statistical learning theory.
Using simulations, we validate our analysis and measure the effects of
various properties of normalized data on accuracy.
We apply our analysis to design easy-to-understand decision rules
to predict when it is safe to avoid joins in order to help analysts
exploit this runtime-accuracy tradeoff.
Experiments with multiple real normalized datasets show that our rules
are able to accurately predict when joins can be avoided safely, and in
some cases, this led to significant reductions in the runtime of some
popular feature selection methods.
Blog post on Hamlet
Last Updated: Apr 2016