Closer integration of machine learning (ML) with data processing is a booming area in both the data management industry and academia. Almost all ML toolkits assume that the input is a single table, but many datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins to obtain features from all base tables and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this work, we show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can "avoid joins safely."

We identify the core technical issue that could cause accuracy to decrease in some cases and analyze this issue by applying statistical learning theory. Using simulations, we validate our analysis and measure the effects of various properties of normalized data on the accuracy. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins when learning linear models such as Naive Bayes and logistic regression in order to help data scientists exploit this runtime-accuracy tradeoff. Experiments with multiple real normalized datasets show that our rules are able to accurately predict when joins can be avoided safely, and in some cases, this led to significant reductions (over 180x!) in the runtimes of some popular feature selection methods.

Blog post on Hamlet

Hamlet++ (Hamlet for High-Capacity Classifiers)

Linear models are nice but data scientists also love complex non-linear models with high capacity, e.g., decision trees and RBF-SVM, due to their potentially higher accuracy. The Hamlet paper established a dichotomy in the safety of avoiding joins for linear models: using foreign key features as representatives of the features brought in by joins could cause extra overfitting. In this follow-up work, we ask: what happens to this dichotomy with such high capacity models? One might expect that these complex ML models with their infinite VC dimensions would face even worse overfitting. Surprisingly, our results show the exact opposite! Using an extensive empirical and simulation-based analysis, we dissect the behavior of these models over joins. We also take a step towards formally explaining their behavior and identify new questions for ML theoretical research. Oh, and as for deep learning, neural networks also exhibit the same behavior!


Hamlet++ (Hamlet for High-Capacity Classifiers) Hamlet