A New Approach to Fragment Assembly in DNA Sequencing

Pavel A. Pevzner

Department of Computer Science and Engineering
University of California, San Diego, San Diego, CA

 

Haixu Tang and Michael S. Waterman
Department of Mathematics,

University of Southern California, Los Angeles, CA


For the last twenty years, fragment assembly in DNA sequencing followed the “overlap-layout-consensus” paradigm that is used in all currently available algorithms. Although this approach culminated in some excellent fragment assembly tools (Phrap, CAP3, TIGR, and  Celera assemblers are among them), it faces difficulties in genomic shotgun assembly: the existing algorithms make assembly errors and are often unable to resolve repeats even in prokaryotic genomes.  All the programs we tested misassembled 10-20% contigs while assembling shotgun reads from Campylobacter jejuni, Neisseria meningitidis, and Lactococcus lactis sequencing projects. Biologists are well-aware of potential assembly errors and are forced to carry additional experiments to verify the assembled contigs.

We abandon the classical  “overlap-layout-consensus” paradigm in favor of a new approach that reduces the fragment assembly to a variation of the classical Eulerian path problem, and, for the first time, resolves the problem of repeats in fragment assembly. Our new approach never even tries to find pairwise overlaps between reads and uses a new idea of equivalent transformations. Instead of attempting to assemble reads from a sequencing project, we instead transform the collection of reads into a new one.  For every such transformation we prove that there is a one-to-one correspondence between the solutions of the new and the original fragment assembly problem, i.e., we prove that the original and new fragment assembly problems are equivalent. After millions of such transformations we reduce the original problem to a new problem that can be solved with our new Eulerian Superpath approach.  Using this solution and applying millions of reverse equivalent transformations, we obtain the fragment assembly for the original problem.

Our new EULER algorithm, based on equivalent transformations, resolves all repeats except long 100% perfect repeats that are theoretically impossible to resolve without additional experiments. This method of equivalent transformations allows one to generate provably optimal error-free fragment assemblies of bacterial genomes.

Comparative tests demonstrate that EULER generates better assemblies than the existing  “overlap-layout-consensus” algorithms. For example, Phrap, CAP3 and TIGR assemblers make 17, 14, and 9 assembly errors correspondingly, while assembling real reads from the N. meningitidis genome. All these algorithms still make errors while assembling the error-free reads from the N. meningitidis genome (the number of errors reduces to 5, 4, and 2 correspondingly). Although the TIGR assembler makes less errors than other programs, this accuracy does not come for free, since this program produces twice more contigs than the other programs. EULER made no assembly errors and produced fewer contigs with real sequencing data than programs produced with error-free data. We further describe an approach to scaling up EULER for assembly of eukaryotic genomes.

See here for detail.