Abhishek Kumar
I am a graduate student at UC San Diego, pursuing a Masters degree in Computer Science. My main research interests are in machine learning, data mining and analytics.

I have worked with Prof. Charles Elkan on novel multilabel learning approaches (PCCs, Beam Search Algorithms, Neural Network Models). I have also worked with Prof. Wendy Chapman at the Division of Biomedical Informatics (DBMI) in developing TextVect, a tool for processing clinical text documents. With the NLP group at DBMI, I've helped in analysis of text classification methods for the database of Genotypes and Phenotypes (dbGaP) and developed a novel recreational drug lexical taxonomy ontology by identifying new terms via crawling the internet. For more information, see my resume and selected publications below.

During the summer of 2012, I worked with the Spam and Abuse team at Google Inc., Mountain View, CA in developing algorithms and systems useful for detecting misuse of Google’s services. After I graduate in March 2013, I plan to move to Mountain View, CA to pursue interesting applications of machine learning algorithms at Google.
Selected Publications and Conference Proceedings
Feature Engineering for Classification of Clinical Text.
Abhishek Kumar
Technical report, UC San Diego pdf
Submitted in support of candidature for the Master of Science degree in Computer Science.
Assigning labels to clinical text documents is challenging because it requires sophisticated feature engineering, and practically important because of the wide adoption of electronic health record (EHR) systems. In this work, we introduce TextVect - a high throughput, modularized tool that facilitates feature engineering of unstructured clinical text. Empirical evaluation of the various feature representation choices on benchmark clinical text datasets suggests that the term-frequency (tf ) and binary encoding methods are best suited for document level classification. Reducing the number of features through feature selection is also helpful, and the BestFirst method outperforms other popular techniques. The most helpful features are those belonging to controlled vocabularies, the UMLS metathesaurus in particular. For datasets with multiple labels, we empirically evaluate the performance of the state-of-the-art probabilistic classifier chains (PCC) method. Results indicate that the PCC method is a better performer compared to the binary relevance method, where a separate classifier is trained for each label. We discuss these results and demonstrate that TextVect is a valuable tool for feature engineering when applied to classification of unstructured clinical text.
Neural Network Models for Multilabel Learning.
Abhishek Kumar, Aditya Menon, Charles Elkan
Under Review, ECML/PKDD 2013
Multilabel learning is an extension of standard binary classification where the goal is to predict a set of labels (we call an individual label a tag) for each input example. The recent probabilistic classifier chain (PCC) method learns a series of probabilistic models that capture tag correlations. In this paper, we show how the PCC model may be viewed as a neural network with connections between output nodes. We then explore the benefits of using a shared hidden layer in the neural network, instead of connections between output nodes. This brings advantages that include tractable test-time inference and removing the need to select a fixed tag ordering.
Learning and Inference in Probabilistic Classifier Chains with Beam Search.
Abhishek Kumar*, Shankar Vembu*, Aditya Menon, Charles Elkan
In Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 7523, Springer Berlin Heidelberg, pp 665-680. pdf
In this paper, we show how to use the classical technique of beam search for multilabel learning (MLL). A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Also, PCC accuracy is sensitive to the ordering of the tags while training. In this work, we show how to use beam search to make inference tractable, and how to integrate beam search with training to determine a suitable tag ordering.
TxtVect: A Tool for Extracting Features from Clinical Documents.
Abhishek Kumar, Wendy Chapman
Poster presentation at the American Medical Informatics Association (AMIA) Annual Symposium 2012. pdf (Poster)
We present TxtVect, a tool for extracting features from clinical documents. It allows for segmentation of documents into paragraphs, sentences, entities, or tokens; and extraction of lexical, syntactic, and semantic features for each of these segments. These features are useful for various machine-learning tasks such as text classification, assertion classification, and relation identification. TxtVect enables users to access these features without installation of the many necessary text processing and NLP tools.
Text Categorization of Heart, Lung and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing N-grams and Metadata Features.
Mindy K. Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway
Under Review, Biomedical Informatics Insights Journal
The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contributions in genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, meaningful use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques and feature representation to improve study retrieval in the context of the dbGaP database. This study demonstrated that a combination of n-gram features, metadata features, and χ2 feature selection applied to dbGaP studies increased classification accuracy and F-measure when compared to unigram-based feature representation. We demonstrated that PubMed studies can be used effectively as a surrogate identifier for related dbGaP studies.
Recreational Drug Slang: Identification of New Terms and Populating a Lexical Taxonomy Ontology.
Mindy K. Ross, Abhishek Kumar, Myoung Lah, Rick Calvo, Mike Conway
Under Review, AMIA Annual Symposium 2013 (Poster Presentation)
In this work, we present a data-driven approach to identifying recreational drug slang. We crawled recreational drug-related Internet forums to build a text corpus and used statistical keyword analysis to identify drug terms. The novelty of this work lies in the application of corpus linguistic techniques to the recreational drug slang domain. We demonstrate the viability of using online sources to mine relevant content and corpus linguistics methodology as a means to discover drug slang terms. In the future, we plan to develop a semi-automated mechanism of discovering drug related slang terms (known and new terms) and structure our taxonomy to include further classes of common slang terms, such as compound drugs and drug paraphernalia.
Immediate Mode Scheduling Methods for Open Online Heterogeneous Systems.
Abhishek Kumar, Navneet Chaubey, Sireesha Yakkali
Student Paper at the 16th Intl. Conf. on High Performance Computing (HiPC) pdf
Grid infrastructures and grid based applications are becoming common approaches for solving large scale science and engineering problems. The efficient scheduling of independent computational jobs in a heterogeneous computing (HC) environment is an important problem in domains such as grid computing. In this work, we consider an online scheduling problem in immediate mode, where jobs arrive over time and are allocated to machines as soon as they arrive. All jobs’ characteristics are unknown before their arrival times. We implemented several scheduling algorithms and measured three metrics for comparison: response time, bounded slowdown and system utilization. Our simulation allowed us to identify which of the considered methods perform better for response time, bounded slowdown and utilization at different system loads. We also evaluate the usefulness of the methods if certain grid characteristics such as heterogeneity of jobs and resources are known in advance.
Artificially Intelligent Grid Assistant.
Roshan Sumbaly, Abhishek Kumar, Shubham Malhotra, Gaurav Paruthi
Student Paper at the 14th Intl. Conf. on High Performance Computing (HiPC) pdf
We present  a Grid based application which works in collaboration with natural language processing (NLP), to act as a virtual assistant. The application can answer  queries in a conversational  manner  and is  capable of being deployed in various scenarios.  Given the prevalence of large data sources in natural  language engineering and the need for raw  computational  power in analysis of such data, the Grid Computing paradigm provides efficiency and scalability otherwise unavailable  to researchers.  In our  work we explore the integration of  Grid with NLP,  to mine relevant answers  from these distributed resources. Our system receives queries from various interfaces and then uses NLP to understand the domain of the question. The Grid then routes these queries  to the correct knowledge farm, depending on the domain found. Knowledge farms are distributed components which have large annotated domain specific  datasets.  We propose a novel  method which involves  the working of  the Grid and NLP in concert  to mine relevant  information quickly. 
Recent Selected Talks
March 2013
UC San Diego, M.S. Oral Exam
Feature Engineering for Classification of Clinical Text (slides)
December 2012
UC San Diego course presentation, MED267
Developing a lexical taxonomy for recreational drug slang terms by crawling the Internet. (slides)
June 2012
Data Mining Cup Competition, Berlin, Germany (link)
Predicting product sales from historical data.
Links

Links to few more things that I've worked or experimented on.
Source code of projects that I've worked on.
Most are available on an Apache-like license.
Hobbies and travelogues
Recent Events
2013/03/20 — Gave a talk on TextVect and feature engineering for clinical text classification. Presented to the CSE faculty committee.
2012/06/17 — Talk on developing a lexical ontology terminology for recreational drugs by discovering new terms via the Internet. Based on course project for MED267.
2012/09/21 — Completed my summer internship at Google!
2012/06/27 — Gave a talk on predicting product sales from historical data, as part of the data mining cup competition organized by Prudsys.
2012/06/17 — Road trip from San Diego to San Francisco. Here are some pics.
Contact
Room 209E,
San Diego Supercomputer Center (SDSC)
La Jolla, CA 92093-0505


3983 Miramar Street,
La Jolla, CA 92037