To Join or Not to Join?
Thinking Twice about Joins before Feature Selection
Feature Selection / Machine Learning
Key-Foreign Key Joins / Database Dependencies
Statistical Learning Theory
Closer integration of machine learning (ML) with data processing is a
booming area in both the data management industry and academia.
Almost all ML toolkits assume that the input is a single table, but many
datasets are not stored as single tables due to normalization.
Thus, analysts often perform key-foreign key joins to obtain features
from all base tables and apply a feature selection method, either
explicitly or implicitly, with the aim of improving accuracy.
In this work, we show that the features brought in by such
joins can often be ignored without affecting ML accuracy significantly,
i.e., we can "avoid joins safely."
We identify the core technical issue that could cause accuracy to decrease
in some cases and analyze this issue by applying statistical learning theory.
Using simulations, we validate our analysis and measure the effects of
various properties of normalized data on the accuracy.
We apply our analysis to design easy-to-understand decision rules
to predict when it is safe to avoid joins when learning linear models such
as Naive Bayes and logistic regression in order to help data scientists
exploit this runtime-accuracy tradeoff.
Experiments with multiple real normalized datasets show that our rules
are able to accurately predict when joins can be avoided safely, and in
some cases, this led to significant reductions (over 180x!) in the runtimes
of some popular feature selection methods.
Blog post on Hamlet
Linear models are nice but data scientists also love complex non-linear models with
high capacity, e.g., decision tree and RBF-SVM, due to their potentially higher accuracy.
The Hamlet paper established a dichotomy in the safety of avoiding joins for linear
models: using foreign key features as representatives of the features brought in by
joins could cause extra overfitting.
In this follow-up work, we ask: what happens to this dichotomy with such high capacity models?
One might expect that these complex ML models with their infinite VC dimensions
would face even worse overfitting. Surprisingly, our results
show the exact opposite! Using an extensive empirical and simulation-based
analysis, we dissect the behavior of these models over joins. We also take a step
towards formally explaining their behavior and identify new questions for ML theoretical
research. Oh, and as for deep learning, neural networks also exhibit the same behavior!