Fragment Assembly in DNA Sequencing
with Stefano Bonissone and Boyko Kakaradov

Imagine several copies of a book cut by scissors into 10 million small pieces. Each copy is cut in an individual way so that a piece from one copy may overlap a piece from another copy. Assuming that 1 million pieces are lost and the remaining 9 million are splashed with ink, try to recover the original text. After doing this you'll get a feeling of what a DNA sequencing problem is like. Classical sequencing technology allows a biologist to read short (300- to 500-letter) fragments per experiment (each of these fragments corresponds to one of the 10 million pieces). Next generation sequencing technologies produce shorter (30-100 nt) fragments. Computational biologists have to assemble the entire genome from these short fragments, a task not unlike assembling the book from millions of slips of paper. The problem is complicated by unavoidable experimental errors (ink splashes).

For the last twenty years, fragment assembly in DNA sequencing followed the "overlap - layout - consensus" paradigm. Although this approach proved to be useful, it does not work for next-generation technologies with huge volumes of data.

We abandon the classical "overlap - layout - consensus" approach in favor of a new Eulerian approach. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem. This reduction opens new possibilities for repeat resolution and allows one to generate solutions of the fragment assembly problems. We are currently exploring applications of the Eulerian approach to the short read technologies developed by 454 Life Sciences, Illumina and ABI.