Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and an affiliate member of the AI Group. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, an area that is increasingly called advanced analytics or data science. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. A paper he co-authored was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS.

Curriculum Vitae     |     Research Blog     |     On GitHub     |     On Twitter

Recent News

New! Excited and honored to be a recipient of a Google Faculty Research Award! Thank you Google!
New! A paper on our framework to generalize factorized ML is up on arXiv. Codenamed Morpheus, it integrates that good-old workhorse of relational algebra, joins, with linear algebra to accelerate ML over multi-table data. Red pill or blue pill?
• A new post on my research blog on the interesting three-years-and-counting story of factorized ML. From Orion to Morpheus: the problem, the ideas, the papers, the people, and the lessons.
• Gave a talk at the CIDR 2017 Gong Show on Cerebro, a new system to dramatically simplify the process of building and using deep neural networks for relational data analytics.


My current research focuses on the foundations of advanced data analytics systems, especially devising data management-inspired abstractions, systems, frameworks, and algorithms to make the end-to-end process of building and using machine learning algorithms for data analytics easier (improving the productivity of data scientists and developers) and faster (improving runtime performance and introducing accuracy trade-offs). I enjoy working on problems that are motivated by real applications and are formally grounded.


Building an ML model is seldom a one-shot slam dunk; it is usually an iterative process. To make this process of "model selection" easier and faster, we repurpose classical database ideas and envision a new class of advanced analytics systems that we call Model Selection Management Systems (MSMS).
To join or not to join? That is the question. In this project, we connect statistical learning theory and relational joins to show why, and how, we can often avoid entire input tables when learning over normalized data without reducing accuracy significantly, but improving runtime performance.
In this project, we extend our paradigm of factorized learning to several ML models in the popular R environment and also introduce factorized scoring. We devise a cost-based optimizer to pick the fastest approach and also help analysts with comparing features from multiple tables.
In this project, we make it easier to apply a class of machine learning models over normalized datasets, which require joins during feature engineering. We devise novel techniques to push machine learning computations down through joins without sacrificing accuracy and study the trade-offs involved in improving performance.
In this project, we formulate a framework of declarative domain-specific language in the R enviroment for the black art of exploratory feature selection in advanced analytics based on our conversations with analysts in many enterprise settings. We design a novel cost-based optimizer that improves runtime performance.
In this project, we build a unified system architecture that implements several machine learning techniques by integrating incremental gradient descent into an RDBMS. This work has been incorporated into products from Oracle, EMC, and Cloudera. We also contributed code to the open-source library MADlib.
In this project, we integrate the management of uncertain content, specifically Optical Character Recognition (OCR) data, with structured data management in an RDBMS. We use a probabilistic graphical model and devise a novel approximation framework to trade off between accuracy and query runtime performance.

Selected Publications   (Full List   |   Google Scholar)

  • Learning Over Joins
    PhD Dissertation [PDF]
  • To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 [Paper] [Tech Report] [Code and Data]
  • Model Selection Management Systems: The Next Frontier of Advanced Analytics
    Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD Record Dec 2015 (Vision Track) [Paper] [Survey]
  • Learning Generalized Linear Models Over Normalized Data
    Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD 2015 [Paper] [Code]
  • Materialization Optimizations for Feature Selection Workloads
    Ce Zhang, Arun Kumar, and Christopher Ré
    ACM SIGMOD 2014 [Paper]
  • Towards a Unified Architecture for in-RDBMS Analytics
    Xixuan Feng*, Arun Kumar*, Benjamin Recht, and Christopher Ré
    ACM SIGMOD 2012 [Paper] [Tech Report] [Code and Data]
  • Probabilistic Management of OCR Data using an RDBMS
    Arun Kumar, and Christopher Ré
    VLDB 2012 [Paper] [Tech Report] [Code and Data]


CSE 190: Topics in Database System Implementation (Spring 2017)
CSE 290: Seminar on Advanced Data Science (Spring 2017)
CSE 291: Topics in Advanced Analytics (Winter 2017)
CS 564: Database Management Systems: Design and Implementation (Fall 2015 at UW-Madison)


Lingjiao Chen (PhD, UW-Madison)
Vraj Shah (MS, UCSD)

Fengan Li (MS, UW-Madison, 2016; First employment: Google)
Zhiwei Fan (BS, UW-Madison, 2016; Onward to MS, UW-Madison)
Fujie Zhan (BS, UW-Madison, 2016; First employment: Epic Systems)
Mona Jalal (MS, UW-Madison, 2015)
Boqun Yan (BS, UW-Madison, 2015; First employment: Google)


Program Committee:
VLDB 2018
ACM SIGMOD 2017 (Research Track, Demonstrations, and Student Research Competition)
ACM SIGMOD 2017 Workshop on Data Management for End-to-End Machine Learning (DEEM)
USENIX HotCloud 2016
ACM SIGMOD 2016 Undergraduate Research Poster Competition

ACM Transactions on Database Systems (TODS) 2017, 2015
IEEE Transactions on Knowledge and Data Engineering (TKDE) 2014