Research

My research interest and strength broadly defined is to develop algorithms for computational problems that arise from the unique features of biological data.

My research projects ranged from genome rearrangements, repeat identification, comparative genomic architecture, to the study of plant duplications.

My present research focus is in the area of computational epigenomics and genetic variations. I am interested in using both combinatorial and machine learning methods to study the combined effect of genomic and epigenomic variations on transcriptional activity, and to elucidate how the cell fate selections are specified in early development and in other self-renewing cell populations.

Computational Epigenomics

The objective of epigenetics research is the elucidation of how genetic information encoded in the DNA sequence and non-genetic aspects jointly control gene expressions. In particular, it aims at understanding the dynamics of human genome, stem cells' capacity to self-renew, and the epigenetic factors contributing to the development of tumors and disease. DNA methylation is one essential type of epigenetic modification. The genome-wide DNA methylation patterns obtained by sequencing are often from a heterogeneous population of cells, for instance, a mixture of normal cells and cancerous cells, or, in the case of single cell, the mixture of the two alleles, which may present distinct individual patterns. I have been developing a statistical model for individual methylome inference and applying it to obtain allele-specific methylation regions in absence of single nucleotide polymorphisms (SNP). These regions are highly likely to have effects on differential gene expression and/or be associated with genetic imprinting. The results will help elucidate the nature of those partially methylated regions present in the differentiated cells but fully methylated in stem cells. (Supervised by Prof. Joe Ecker).

Multiple Assembly in Re-sequencing Analysis (NG High-Throughput Sequencing)

Motivated by certain analyses in genome re-sequencing, we formulated a novel family of computational problems. These are assembly problems for which multiple distinct sequences must be assembled, but for which the relative orientation and order of the reads to be assembled are known (usually as a result of mapping the reads to a reference or related genome). We developed efficient algorithms for certain variants of the problem and shown the computational intractability (NP-hardness) for other variants. There is a wide range of applications of these problems, including haplotype inference of polyploid organisms, population study in metagenomics, as well as individual epigenome inference from DNA methylation data of a heterogeneous population of cells. (Collaborated with Prof. Andrew Smith).

Large Scale Duplications and Synteny Blocks (Comparative Genomics)

The genome rearrangement studies had been focused on mammals and chickens. After going over the literature on plant genomes and learning that plants present a bigger challenge due to their extensive large duplications, I decided to extend the studies to the more difficult case of plant genomes. We developed a new algorithm based on the A-Bruijn graph framework that overcomes the difficulties of the existing synteny block reconstruction algorithms to address large-scale duplications and scalabilities. We applied the algorithm to derive large duplication blocks in the model plant genome A. thaliana, which is important in understanding the plant duplication history that provided an evolutionary advantage. We further generalized this approach to synteny block generation for multiple genomes, which provides a foundation for further comparative genomics studies. (Supervised by Profs. Pavel Pevzner and Joe Ecker)

Repeat Identification and Classification

I constructed the repeat library for model plant Arabidopsis using a de novo repeat classification approach, and identified more repeat sequences in comparison to the known repeat libraries. The results were incorporated into the SIGnAL Arabidopsis methylome mapping tool at Ecker's lab. Repeat identification is very important in the sequence analysis pipeline as repeats need be masked for subsequence analysis. Repeat regions may also become the focus of investigation for studying biologically correlated features. (Supervised by Profs. Pavel Pevzner and Joe Ecker)

Fragile Breakage Model of Chromosome Evolution

For many years, studies of chromosome evolution were dominated by the random breakage theory. There had been hot debates since 2003 between the random breakage model and a newly proposed alternative, the fragile breakage model, of chromosome evolution. This is directly relevant to the intriguing question about the existence of rearrangement hotspots in chromosome evolution (and their correlations with cancer breakpoints). Through the studies of the synteny block generation algorithms and rearrangement simulations on human, mouse, rat, and simulated genomes, we showed that the arguments in support of the random breakage model were flawed, thereby settling the long debate between the random breakage model and the fragile breakage model of chromosome evolution. I further studied the links between rearrangement, regulatory regions, repeats, and gene distribution, which may partially explain the breakpoint re-use phenomenon, the basis of the fragile breakage model. (Supervised by Prof. Pavel Pevzner)