A New Approach to Fragment
Assembly in DNA Sequencing
Pavel A. Pevzner
Department of Computer
Science and Engineering
University of California, San Diego, San Diego,
CA
Haixu Tang and Michael S.
Waterman
Department of
Mathematics,
University of Southern
California, Los Angeles, CA
For the last twenty
years, fragment assembly in DNA sequencing followed
the “overlap-layout-consensus” paradigm that is used in all currently
available algorithms. Although this approach culminated in some excellent
fragment assembly tools (Phrap, CAP3, TIGR, and Celera assemblers are among them), it
faces difficulties in genomic shotgun assembly: the existing algorithms
make assembly errors and are often unable to resolve repeats even in prokaryotic
genomes. All the programs we tested
misassembled 10-20% contigs while assembling shotgun reads from Campylobacter
jejuni, Neisseria meningitidis, and Lactococcus lactis
sequencing projects. Biologists are well-aware of potential assembly errors and
are forced to carry additional experiments to verify the assembled
contigs.
We abandon the
classical “overlap-layout-consensus” paradigm in favor of a
new approach that reduces the fragment assembly to a variation
of the classical Eulerian path problem, and, for the first time,
resolves the problem of repeats in fragment assembly. Our new approach
never even tries to find pairwise overlaps between reads and uses a new idea of
equivalent transformations. Instead of attempting to assemble reads from a
sequencing project, we instead transform the collection of reads into a new
one. For every such transformation
we prove that there is a one-to-one correspondence between the solutions of the
new and the original fragment assembly problem, i.e., we prove that the original
and new fragment assembly problems are equivalent. After millions of such
transformations we reduce the original problem to a new problem that can be
solved with our new Eulerian Superpath approach. Using this solution and applying
millions of reverse equivalent transformations, we obtain the fragment assembly
for the original problem.
Our
new EULER algorithm, based on equivalent transformations, resolves all
repeats except long 100% perfect repeats that are theoretically impossible
to resolve without additional experiments. This method of equivalent
transformations allows one to generate provably optimal
error-free fragment assemblies of bacterial genomes.
Comparative tests demonstrate that EULER generates better assemblies than the existing “overlap-layout-consensus” algorithms. For example, Phrap, CAP3 and TIGR assemblers make 17, 14, and 9 assembly errors correspondingly, while assembling real reads from the N. meningitidis genome. All these algorithms still make errors while assembling the error-free reads from the N. meningitidis genome (the number of errors reduces to 5, 4, and 2 correspondingly). Although the TIGR assembler makes less errors than other programs, this accuracy does not come for free, since this program produces twice more contigs than the other programs. EULER made no assembly errors and produced fewer contigs with real sequencing data than programs produced with error-free data. We further describe an approach to scaling up EULER for assembly of eukaryotic genomes.
See here for detail.