Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He is a member of the Database Lab and an affiliate member of the AI Group and CNS. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, an area that is increasingly called advanced analytics or data science. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. He is a recipient of the Best Paper Award at ACM SIGMOD 2014, the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS, and a 2016 Google Faculty Research Award.

Curriculum Vitae     |     Research Blog     |     On GitHub     |     On Twitter

Recent News

New! Visited and gave a talk at UMichigan (thanks, Barzan!); excited to see all the work on data systems + ML + HCI!
• Excited and honored to be a recipient of an inaugural Faculty Research Award by Opera Solutions!
• Visited and gave talks at Google Irvine and Amazon Berlin; attended the MSR Faculty Summit on the Edge of AI.
• The talk video and slide deck of our SIGMOD tutorial have been posted online!


My current research focuses on the foundations of advanced data analytics systems, especially devising data management-inspired abstractions, systems, frameworks, and algorithms to make the end-to-end process of building and using machine learning algorithms for data analytics easier (improving the productivity of data scientists and software engineers) and faster (improving runtime performance and introducing accuracy trade-offs). Thus, the key themes of my research are usability, developability, performance, and scalability. I enjoy working on problems that are motivated by real applications and are formally grounded. I also enjoy insightful conversations with practitioners on the frontlines of data analytics.


In this project, we generalize and automate the idea of factorized machine learning to any ML algorithm expressible in linear algebra. This represents a major step towards unifying relational and linear algebras for integrating feature engineering and machine learning over structured data.
Building an ML model is seldom a one-shot slam dunk; it is usually an iterative process. To make this process of "model selection" easier and faster, we repurpose classical database ideas and envision a new class of advanced analytics systems that we call Model Selection Management Systems (MSMS).
To join or not to join? That is the question. In this project, we connect statistical learning theory and relational joins to show why, and how, we can often avoid entire input tables when learning over normalized data without reducing accuracy significantly, but improving runtime performance.
In this project, we make it easier to apply a class of machine learning models over normalized datasets, which require joins during feature engineering. We devise novel techniques to push ML computations down through joins without sacrificing accuracy and study the trade-offs involved in improving performance.

Past Projects


Recent Publications   (Full List   |   Google Scholar)

  • Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?
    Vraj Shah, Arun Kumar, and Xiaojin Zhu
    Under submission [PDF]
  • Towards Linear Algebra over Normalized Data
    Lingjiao Chen, Arun Kumar, Jeffrey Naughton, and Jignesh Patel
    VLDB 2017 (To appear) [Paper] [TechReport] [Code and Data]
  • Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics
    Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton
    ACM SIGMOD 2017 [Paper] [TechReport]
  • Data Management in Machine Learning: Challenges, Techniques, and Systems
    Arun Kumar, Matthias Boehm, and Jun Yang
    ACM SIGMOD 2017 Tutorial [Paper] [Talk at SIGMOD] [Slides]
  • SpeakQL: Towards Speech-driven Multi-modal Querying
    Dharmil Chandarana, Vraj Shah, Arun Kumar, and Lawrence Saul
    ACM SIGMOD 2017 HILDA Workshop [Paper]
  • Model-based Pricing: Do Not Pay for More than What You Learn!
    Lingjiao Chen, Paraschos Koutris, and Arun Kumar
    ACM SIGMOD 2017 DEEM Workshop [Paper]
  • Learning Over Joins
    PhD Dissertation. UW-Madison 2016 [PDF] [Talk at UCSD]
  • To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 [Paper] [Tech Report] [Code and Data]
  • Model Selection Management Systems: The Next Frontier of Advanced Analytics
    Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD Record Dec 2015 (Vision Track) [Paper] [Survey]
  • Learning Generalized Linear Models Over Normalized Data
    Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD 2015 [Paper] [Code]


CSE 290: Seminar on Advanced Data Science (Fall 2017)
CSE 190: Topics in Database System Implementation (Spring 2017)
CSE 290: Seminar on Advanced Data Science (Spring 2017)
CSE 291: Topics in Advanced Analytics (Winter 2017)
CS 564: Database Management Systems: Design and Implementation (Fall 2015 at UW-Madison)


Lingjiao Chen (PhD, UW-Madison; co-advised by Paris Koutris)
Supun Nakandala (PhD, UCSD)
Vraj Shah (MS, UCSD)
Anthony Thomas (MS, UCSD)

Mingyang Wang (MS, UCSD, 2017)
Fengan Li (MS, UW-Madison, 2016; First employment: Google)
Zhiwei Fan (BS, UW-Madison, 2016; Onward to MS, UW-Madison)
Fujie Zhan (BS, UW-Madison, 2016; First employment: Epic Systems)
Mona Jalal (MS, UW-Madison, 2015)
Boqun Yan (BS, UW-Madison, 2015; First employment: Google)


Program Committee:
VLDB 2018
ACM SIGMOD 2017 (Research Track, Demonstrations, and Student Research Competition)
ACM SIGMOD 2017 Workshop on Data Management for End-to-End Machine Learning (DEEM)
USENIX HotCloud 2016
ACM SIGMOD 2016 Undergraduate Research Poster Competition

ACM Transactions on Database Systems (TODS) 2017, 2015
IEEE Transactions on Knowledge and Data Engineering (TKDE) 2014