The information-rich data from the next generation sequencing (NGS) have potentials to answer many important biological questions, ranging from understanding how complex animal life evolved through changes in DNA (Genome 10K project) to unraveling causal mutations in many complicated diseases (The Cancer Genome Atlas). As NGS becomes pervasive and affordable, it is important that the development of computational methods for interpreting these data keeps pace. Here I outline major directions that I will pursue:


Using single cell sequencing to characterize different cell types and somatic mosaicism in the human brain

One important problem in neuroscience is to determine number of different cell types in the human brain. Cell types indicate stable properties, and these cells have generally arisen by differentiation through multiple developmental pathways. The second issue is to characterize the roles of somatic mosaicism of DNA in neurons within a brain. Retrotransposition is particularly common in neurons during early stages of development and contributes to genomic differences between neurons, which may result in the functional diversity of neurons and their associated circuits. In collaborations with Roger Lasken’s and Fred Gage’s groups, we will continue to develop bioinformatics pipelines that use single cell sequencing from individual brain cells to find answers to these questions.

Reference-assisted assembly for mammalian genomes and ancestral genomes reconstruction

Despite the rapid development of sequencing technologies, assembly of mammalian genomes into complete chromosomes is one of the most challenging problems in bioinformatics. To help address this difficulty, we will continue develop Ragout2, a reference-assisted assembly tool for mammalian genomes. Ragout2 will be developed based on Ragout (ISMB2014), a reference assisted assembly tool for bacterial genomes. Taking as input a target assembly (generated from an NGS assembler) and one or multiple related references, Ragout2 infers their evolutionary relationship and builds the final assembly of the target genome using a genome rearrangement approach. While Ragout2 and Ragout both use multiple references to infer the final assembly, Ragout2 will address many new algorithmic challenges that do not appear in the bacterial genome assembly. Ragout2 algorithm is in active development in collaboration with UCSD (Mikhail Kolmogorov in Pavel Pevzner's group), the European Bioinformatics Institute (David Thybert), and UCSC Center for Biomolecular Science & Engineering (Benedict Paten in David Haussler's group).

Structural variations, transposable elements detection, and their roles in neurological diseases

Genomic structural variation (SV), including copy number variation (CNV), gene fusion, transposable element (TE), is a significant contributor to many devastating diseases, including cancer and neurological diseases. Most current approaches for SV detection use either read depth or discordant read-pairs from high-throughput short read sequencing data. Though these methods continue to be refined, structural variation detection has reached a bottleneck on two fronts. First, current algorithms perform poorly in repetitive regions. Second, these algorithms have not been designed to detect complex structural variation events where more than two breakpoints are involved. With recent advancements in long read mapping tools, we combine a rearrangement algorithm with the mapping information for variant calling. This rearrangement analysis step together with the longer read length will help to detect complex structural variant events in repetitive regions. We will apply the developed tool to study the role of structural variations and retrotransposon elements in human aging and neurological diseases.

Long read error correction and multiploid genome assembly

Recently, long-read sequencing technologies have emerged (e.g., Pacific Biosciences, Nanopore) and now can produce reads of tens of kilobases in length. However, these long reads go with very high error rates (16% error rate for Pacific Biosciences) and it remains a challenge to assemble genomes only from these long reads. To attack this problem, we aim first to develop an error correction tool for long read data, using the k-mer index profile and a local assembly approach. Reads sharing high scores of k-mer profiles are grouped together and then subjected to a local assembly step. From each consensus contig, an error correction step is performed, which aims to minimize the total number of changes made to each read cluster. When a diploid genome has few structural variations between homologous chromosomes, assembly approaches designed for haploid genomes may work well. However, in diploid genomes with many structural variations, current assemblers generate fragmented contigs. To tackle this problem, one can use the assembly graph to mask polymorphism in contigs and represent the genome in a comprehensive "repeat" graph, where multiple homologous regions are collapsed into a path in this graph. Long reads will be used to further distinguish multiple copies of homologous chromosomes.