Abhishek Kumar
I am a graduate student at
UC San Diego, pursuing a Masters degree in
Computer Science. My main research interests are in machine learning, data mining and analytics.
I have worked with
Prof. Charles Elkan on novel multilabel learning approaches (
PCCs,
Beam Search Algorithms,
Neural Network Models). I have also worked with
Prof. Wendy Chapman at the Division of Biomedical Informatics (
DBMI) in developing
TextVect, a tool for processing clinical text documents. With the NLP group at DBMI, I've helped in
analysis of text classification methods for the database of Genotypes and Phenotypes (dbGaP) and developed a novel
recreational drug lexical taxonomy ontology by identifying new terms via crawling the internet. For more information, see my
resume and selected publications
below.
During the summer of 2012, I worked with the
Spam and Abuse team at Google Inc., Mountain View, CA in developing algorithms and systems useful for detecting misuse of Google’s services. After I graduate in March 2013, I plan to move to Mountain View, CA to pursue interesting applications of machine learning algorithms at Google.
Selected Publications and Conference Proceedings
Abhishek Kumar*, Shankar Vembu*, Aditya Menon, Charles Elkan
Under Review, Machine Learning Journal, 2013
This is an extended version of the earlier conference publication titled "Learning and Inference in Probabilistic Classifier Chains with Beam Search."
below
Feature Engineering for Classification of Clinical Text.
Abhishek Kumar
Technical report, UC San Diego
pdf
Submitted in support of candidature for the Master of Science degree in Computer Science.
Assigning labels to clinical text documents is challenging because it requires sophisticated
feature engineering, and practically important because of the wide
adoption of electronic health record (EHR) systems. In this work, we introduce
TextVect - a high throughput, modularized tool that facilitates feature
engineering of unstructured clinical text. Empirical evaluation of the various feature
representation choices on benchmark clinical text datasets suggests that the
term-frequency (tf ) and binary encoding methods are best suited for document
level classification. Reducing the number of features through feature selection is
also helpful, and the BestFirst method outperforms other popular techniques. The
most helpful features are those belonging to controlled vocabularies, the UMLS
metathesaurus in particular. For datasets with multiple labels, we empirically evaluate
the performance of the state-of-the-art probabilistic classifier chains (PCC)
method. Results indicate that the PCC method is a better performer compared to
the binary relevance method, where a separate classifier is trained for each label.
We discuss these results and demonstrate that TextVect is a valuable tool for
feature engineering when applied to classification of unstructured clinical text.
Neural Network Models for Multilabel Learning.
Abhishek Kumar, Aditya Menon, Charles Elkan
Under Review, ECML/PKDD 2013
Multilabel learning is an extension of standard binary classification where the goal is to predict a set of labels (we call an individual label a tag) for each input example. The recent probabilistic classifier chain (PCC) method learns a series of probabilistic models that capture tag correlations. In this paper, we show how the PCC model may be viewed as a neural network with connections between output nodes. We then explore the benefits of using a shared hidden layer in the neural network, instead of connections between output nodes. This brings advantages that include tractable test-time inference and removing the need to select a fixed tag ordering.
Learning and Inference in Probabilistic Classifier Chains with Beam Search.
Abhishek Kumar*, Shankar Vembu*, Aditya Menon, Charles Elkan
In Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 7523, Springer Berlin Heidelberg, pp 665-680.
pdf
In this paper, we show how to use the classical technique of beam search for multilabel learning (MLL). A recent method for multilabel learning called probabilistic classifier chains (PCCs) has several appealing properties. However, PCCs suffer from the computational issue that inference (i.e., predicting the label of an example) requires time exponential in the number of tags. Also, PCC accuracy is sensitive to the ordering of the tags while training. In this work, we show how to use beam search to make inference tractable, and how to integrate beam search with training to determine a suitable tag ordering.
TxtVect: A Tool for Extracting Features from Clinical Documents.
Abhishek Kumar, Wendy Chapman
Poster presentation at the American Medical Informatics Association (AMIA) Annual Symposium 2012.
pdf (
Poster)
We present TxtVect, a tool for extracting features from clinical documents. It allows for segmentation of documents
into paragraphs, sentences, entities, or tokens; and extraction of lexical, syntactic, and semantic features for each of
these segments. These features are useful for various machine-learning tasks such as text classification, assertion
classification, and relation identification. TxtVect enables users to access these features without installation of the
many necessary text processing and NLP tools.
Text Categorization of Heart, Lung and Blood Studies in the Database of Genotypes and Phenotypes (dbGaP) Utilizing N-grams and Metadata Features.
Mindy K. Ross, Ko-Wei Lin, Karen Truong, Abhishek Kumar, Mike Conway
Under Review, Biomedical Informatics Insights Journal
The database of Genotypes and Phenotypes (dbGaP) allows researchers to understand phenotypic contributions in genetic conditions, generate new hypotheses, confirm previous study results, and identify control populations. However, meaningful use of the database is hindered by suboptimal study retrieval. Our objective is to evaluate text classification techniques and feature representation to improve study retrieval in the context of the dbGaP database. This study demonstrated that a combination of n-gram features, metadata features, and χ2 feature selection applied to dbGaP studies increased classification accuracy and F-measure when compared to unigram-based feature representation. We demonstrated that PubMed studies can be used effectively as a surrogate identifier for related dbGaP studies.
Recreational Drug Slang: Identification of New Terms and Populating a Lexical Taxonomy Ontology.
Mindy K. Ross, Abhishek Kumar, Myoung Lah, Rick Calvo, Mike Conway
Under Review, AMIA Annual Symposium 2013 (Poster Presentation)
In this work, we present a data-driven approach to identifying recreational drug slang. We crawled recreational drug-related Internet forums to build a text corpus and used statistical keyword analysis to identify drug terms. The novelty of this work lies in the application of corpus linguistic techniques to the recreational drug slang domain. We demonstrate the viability of using online sources to mine relevant
content and corpus linguistics methodology as a means to discover drug slang terms. In the future, we plan to develop a
semi-automated mechanism of discovering drug related slang terms (known and new terms) and structure our taxonomy
to include further classes of common slang terms, such as compound drugs and drug paraphernalia.
Immediate Mode Scheduling Methods for Open Online Heterogeneous Systems.
Abhishek Kumar, Navneet Chaubey, Sireesha Yakkali
Student Paper at the 16th Intl. Conf. on High Performance Computing
(HiPC)
pdf
Grid infrastructures and grid based applications
are becoming common approaches for solving large
scale science and engineering problems. The efficient
scheduling of independent computational jobs in a
heterogeneous computing (HC) environment is an
important problem in domains such as grid computing.
In this work, we consider an online scheduling problem
in immediate mode, where jobs arrive over time and
are allocated to machines as soon as they arrive. All
jobs’ characteristics are unknown before their arrival
times. We implemented several scheduling algorithms
and measured three metrics for comparison: response
time, bounded slowdown and system utilization. Our
simulation allowed us to identify which of the
considered methods perform better for response time,
bounded slowdown and utilization at different system
loads. We also evaluate the usefulness of the methods if
certain grid characteristics such as heterogeneity of
jobs and resources are known in advance.
Artificially Intelligent Grid Assistant.
Roshan Sumbaly, Abhishek Kumar, Shubham Malhotra, Gaurav Paruthi
Student Paper at the 14th Intl. Conf. on High Performance Computing (HiPC)
pdf
We present a Grid based application which works in collaboration with natural language processing
(NLP), to act as a virtual assistant. The application can answer queries in a conversational manner and is capable of being deployed in various scenarios. Given the prevalence of large data sources in natural
language engineering and the need for raw computational power in analysis of such data, the Grid Computing paradigm provides efficiency and scalability otherwise unavailable to researchers. In our work we explore the integration of Grid with NLP, to mine
relevant answers from these distributed resources. Our system receives queries from various interfaces and then
uses NLP to understand the domain of the question. The Grid then routes these queries to the correct knowledge
farm, depending on the domain found. Knowledge farms are distributed components which have large annotated
domain specific datasets. We propose a novel method which involves the working of the Grid and NLP in
concert to mine relevant information quickly.
March 2013
UC San Diego, M.S. Oral Exam
Feature Engineering for Classification of Clinical Text (
slides)
December 2012
UC San Diego course presentation, MED267
Developing a lexical taxonomy for recreational drug slang terms by crawling the Internet. (
slides)
June 2012
Data Mining Cup Competition, Berlin, Germany (
link)
Predicting product sales from historical data.
Links to few more things that I've worked or experimented on.
Source code of projects that I've worked on.
Most are available on an Apache-like license.