Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and an affiliate member of the AI Group and CNS. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, an area that is increasingly called advanced analytics or data science. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. He is a recipient of the Best Paper Award at ACM SIGMOD 2014, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.

Curriculum Vitae     |     Research Blog     |     On GitHub     |     On Twitter

Recent News

New! A short paper on SpeakQL, our new interface and system to bring SQL into the speech-first era of computing has been accepted to the HILDA Workshop at SIGMOD'17! Just speak SQL!
New! Complex non-linear classifiers overfit more than linear models, right? Think again, as the Hamlet drama intensifies! Preprint of our Hamlet++ paper is up on arXiv; code and data available here.
New! A short paper on enabling fine-grained pricing for ML in cloud data markets has been accepted to the DEEM Workshop at SIGMOD'17! More DEEM news: I am also giving a keynote talk at this interesting and timely workshop.
New! The webpages for my new 190 on DBMS internals and 290 on data science are up!
• A paper on bolt-on differential privacy for SGD has been accepted to SIGMOD'17! Differential privacy considered harmful no more for efficiency! More SIGMOD news: I am co-presenting a tutorial on data management challenges in ML.
• Linear algebra meets relational algebra as factorized ML comes of age with Morpheus! Red pill or blue pill?


My current research focuses on the foundations of advanced data analytics systems, especially devising data management-inspired abstractions, systems, frameworks, and algorithms to make the end-to-end process of building and using machine learning algorithms for data analytics easier (improving the productivity of data scientists and software engineers) and faster (improving runtime performance and introducing accuracy trade-offs). Thus, the key themes of my research are usability, developability, performance, and scalability. I enjoy working on problems that are motivated by real applications and are formally grounded. I also enjoy insightful conversations with practitioners on the frontlines of data analytics.


Building an ML model is seldom a one-shot slam dunk; it is usually an iterative process. To make this process of "model selection" easier and faster, we repurpose classical database ideas and envision a new class of advanced analytics systems that we call Model Selection Management Systems (MSMS).
To join or not to join? That is the question. In this project, we connect statistical learning theory and relational joins to show why, and how, we can often avoid entire input tables when learning over normalized data without reducing accuracy significantly, but improving runtime performance.
In this project, we extend our paradigm of factorized learning to several ML models in the popular R environment and also introduce factorized scoring. We devise a cost-based optimizer to pick the fastest approach and also help analysts with comparing features from multiple tables.
In this project, we make it easier to apply a class of machine learning models over normalized datasets, which require joins during feature engineering. We devise novel techniques to push machine learning computations down through joins without sacrificing accuracy and study the trade-offs involved in improving performance.
In this project, we formulate a framework of declarative domain-specific language in the R enviroment for the black art of exploratory feature selection in advanced analytics based on our conversations with analysts in many enterprise settings. We design a novel cost-based optimizer that improves runtime performance.
In this project, we build a unified system architecture that implements several machine learning techniques by integrating incremental gradient descent into an RDBMS. This work has been incorporated into products from Oracle, EMC, and Cloudera. We also contributed code to the open-source library MADlib.
In this project, we integrate the management of uncertain content, specifically Optical Character Recognition (OCR) data, with structured data management in an RDBMS. We use a probabilistic graphical model and devise a novel approximation framework to trade off between accuracy and query runtime performance.

Recent Publications   (Full List   |   Google Scholar)

  • Stop That Join! Discarding Dimension Tables when Learning High Capacity Classifiers
    Vraj Shah, Arun Kumar, and Xiaojin Zhu
    Under submission [PDF on arXiv]
  • SpeakQL: Towards Speech-driven Multi-modal Querying
    Dharmil Chandarana, Vraj Shah, Arun Kumar, and Lawrence Saul
    ACM SIGMOD 2017 HILDA Workshop (To Appear) [PDF]
  • Model-based Pricing: Do Not Pay for More than What You Learn!
    Lingjiao Chen, Paraschos Koutris, and Arun Kumar
    ACM SIGMOD 2017 DEEM Workshop (To Appear) [PDF]
  • Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics
    Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton
    ACM SIGMOD 2017 (To Appear) [PDF on arXiv]
  • Towards Linear Algebra over Normalized Data
    Lingjiao Chen, Arun Kumar, Jeffrey Naughton, and Jignesh Patel
    Under submission [PDF on arXiv]
  • Learning Over Joins
    PhD Dissertation. UW-Madison 2016 [PDF] [Talk at UCSD]
  • To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 [Paper] [Tech Report] [Code and Data]
  • Model Selection Management Systems: The Next Frontier of Advanced Analytics
    Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD Record Dec 2015 (Vision Track) [Paper] [Survey]
  • Learning Generalized Linear Models Over Normalized Data
    Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD 2015 [Paper] [Code]


CSE 190: Topics in Database System Implementation (Spring 2017)
CSE 290: Seminar on Advanced Data Science (Spring 2017)
CSE 291: Topics in Advanced Analytics (Winter 2017)
CS 564: Database Management Systems: Design and Implementation (Fall 2015 at UW-Madison)


Lingjiao Chen (PhD, UW-Madison; co-advised by Paris Koutris)
Vraj Shah (MS, UCSD)
Anthony Thomas (MS, UCSD)
Mingyang Wang (MS, UCSD)

Fengan Li (MS, UW-Madison, 2016; First employment: Google)
Zhiwei Fan (BS, UW-Madison, 2016; Onward to MS, UW-Madison)
Fujie Zhan (BS, UW-Madison, 2016; First employment: Epic Systems)
Mona Jalal (MS, UW-Madison, 2015)
Boqun Yan (BS, UW-Madison, 2015; First employment: Google)


Program Committee:
VLDB 2018
ACM SIGMOD 2017 (Research Track, Demonstrations, and Student Research Competition)
ACM SIGMOD 2017 Workshop on Data Management for End-to-End Machine Learning (DEEM)
USENIX HotCloud 2016
ACM SIGMOD 2016 Undergraduate Research Poster Competition

ACM Transactions on Database Systems (TODS) 2017, 2015
IEEE Transactions on Knowledge and Data Engineering (TKDE) 2014