Software and Web Tool

MGRA -- Multiple Genome Rearrangements and Ancestors

Recently completed whole genome sequencing projects marked the transition from gene-based phylogenetic studies to phylogenomics analysis of entire genomes. We developed an algorithm MGRA for reconstruction of phylogenetic trees as well as ancestral genomes and applied it to study the rearrangement history of seven mammalian genomes: human, chimpanzee, macaque, mouse, rat, dog, and opossum. MGRA relies on the new notion of the multiple breakpoint graphs to overcome some limitations of the existing approaches to ancestral genome reconstructions. In particular, we applied \MGRA to analyze the primate--rodent--carnivore controversy in mammalian phylogeny, i.e., the alternative between the primate--rodent and primate--carnivore clades. MGRA provided the rearrangement-based evidence, albeit limited, for the primate--carnivore clade as opposed to the currently favored primate--rodent clade.

MGRA sources ver. 1.0 (2009/02/15)

EULER - Genome Assembler

Pavel A. Pevzner, Haixu Tang, and Michael S. Waterman. An Eulerian path approach to DNA fragment assembly. PNAS , August 14, 2001 , vol. 98 , no. 17 , 9748-9753

Pevzner PA, Tang H. Fragment assembly with double-barreled data. Bioinformatics. 2001;17 Suppl 1:S225-33.

Pevzner PA, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Res. 2004 Sep;14(9):1786-96.

Euler abandons the classical "overlap - layout - consensus" paradigm that is used in all currently available assembly tools, in favor of a new Eulerian approach. EULER, in contrast to other assemblers, does not mask repeats but uses them instead as a powerful fragment assembly tool. This EULER server provides an online tool to assemble DNA sequences.

Description | Instructions


Son K. Pham and Pavel A. Pevzner. DRIMM-Synteny: Decomposing Genomes into Evolutionary Conserved Segments. Bioinformatics, 2010:1367-4803

DRIMM algorithm extends the GRIMM-Synteny (Genome Rearrangements in Men and Mice) algorithm towards the problem of identifying the synteny blocks in highly duplicated genomes. The algorithm devises the first A-Bruijn graph approach that does not require a threading step and substituting it with an alternative genome transformation step implemented in the DRIMM-Synteny.


GRIMM - Genome Rearrangements in Man and Mouse (and other pairwise genome rearrangements)

Hannenhalli S., Pevzner P.A. Transforming cabbage into turnip (polynomial algorithm for sorting signed permutations by reversals). Journal of the ACM (JACM), Volume 46, Issue 1 (January 1999)

Tesler, G., GRIMM. Genome rearrangements web server. Bioinformatics, 18(3), March 2002, 492-493.

Tesler, G., Efficient algorithms for multichromosomal genome rearrangements. Journal of Computer and System Sciences, 65 no. 3 (2002), 587-609.

Two genomes may have many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via reversals, translocation, fission, and fusion. We provide a web server combining rearrangement algorithms for unichromosomal and multichromosomal genomes, with either signed or unsigned gene data. In each case, it computes the minimum possible number of rearrangement steps, and determines a possible scenario taking this number of steps.

Description | Run it | Instructions

MGR - Multiple Genome Rearrangements

Bourque G, Pevzner PA. Genome-scale evolution: reconstructing gene orders in the ancestral species. Genome Res. 2002 Jan;12(1):26-36.

Several genomes derived from a common ancestor may share many genes, but they may be arranged in a different sequence and be located on different chromosomes. The multiple genome rearrangement problem is to determine an optimal phylogenetic tree showing how a given collection of genomes may have evolved from a common ancestor.

Description | Run it | Instructions

InvChecker: Micro-inversion Finder

Chaisson MJ, Raphael BJ, Pevzner PA. Microinversions in mammalian evolution. Proc Natl Acad Sci U S A. 2006 Dec 26;103(52):19824-9. Epub 2006 Dec 22.

In the past, the problem of phylogenetic tree reconstruction was mainly addressed by studies of highly homoplasic point mutations, thus making it difficult to resolve short branches of phylogenetic trees. We show that micro-inversions provide a new rich source of nearly perfect (homeoplasy-free) evolutionary characters that can be used as ``certificates" to verify different branches in a phylogenetic tree. Moreover, switching from point mutations to micro-inversions turns the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of micro-inversions in genomes of mammals from the NISC Comparative Sequencing Program, an untapped source of new phylogenetic characters.

InvChecker finds which reverse strand local alignments from a whole genome comparison correspond to true inversions.

Download source | Instructions

MULTIPROFILER - subtle motif finder

Keich U, Pevzner PA. Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics. 2002 Oct;18(10):1382-90.

Keich U, Pevzner PA. Finding motifs in the twilight zone. Bioinformatics. 2002 Oct;18(10):1374-81.

In the past, the problem of phylogenetic tree reconstruction was mainly addressed by studies of highly homoplasic point mutations, thus making it difficult to resolve short branches of phylogenetic trees. We show that micro-inversions provide a new rich source of nearly perfect (homeoplasy-free) evolutionary characters that can be used as ``certificates" to verify different branches in a phylogenetic tree. Moreover, switching from point mutations to micro-inversions turns the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of micro-inversions in genomes of mammals from the NISC Comparative Sequencing Program, an untapped source of new phylogenetic characters.

InvChecker finds which reverse strand local alignments from a whole genome comparison correspond to true inversions.

Download source

Splicing graphs - a new EST assembly approach

Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA. Splicing graphs and EST assembly problem. Bioinformatics. 2002;18 Suppl 1:S181-8.

Alternative splicing has many implications for physiology, development and the genesis of diseases. One important approach to investigate alternative splicing is to assemble expressed sequence tags (ESTs) into a list of consensus sequences and to analyze them subsequently. Unfortunately, the EST assembly problem usually has multiple solutions. Instead of exhaustively listing these solutions, the splicing graph integrates the ESTs of a gene and represents all potential assemblies simultaneously as paths in the graph. This yields a compact, unambiguous, and biologically meaningful visualization of the huge EST data, showing all potential splicing variants and their relationships.

Download source

PatternBranching/ProfileBranching - Finding subtle motifs by branching from sample strings

Price A., Ramabhadran S., Pevzner P.A. 2003. Finding subtle motifs by branching from sample strings. Bioinformatics. 2003 Oct;19 Suppl 2:ii149-55.

Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both pattern-based and profile-based settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.

Download source

EULER: short reads assembler

Chaisson M, Pevzner P, Tang H. Fragment assembly with short reads. Bioinformatics. 2004 Sep 1;20(13):2067-74. Epub 2004 Apr 1.

Chaisson M, Pevzner P. Short read fragment assembly of bacterial genomes. Genom Res. 2008 Epub 2007 Dec 14.

The EULER-SR assembly package contains a suite of programs for correcting errors in short reads and assembling them. Our assembler may take as input classical Sanger reads, 454 sequences, and Illumina reads.

Download source

MS/MS spectra assembly

Bandeira N, Tang H, Bafna V, Pevzner P. Shotgun protein sequencing by tandem mass spectra assembly. Anal Chem. 2004 Dec 15;76(24):7221-33.

Bandeira N, Clauser KR, Pevzner PA. Shotgun protein sequencing: assembly of peptide tandem mass spectra from mixtures of modified proteins. Mol Cell Proteomics. 2007 Jul;6(7):1123-34. Epub 2007 Apr 19.

The analysis of mass spectrometry data is still largely based on identification of single MS/MS spectra and does not attempt to make use of the extra information available in multiple MS/MS spectra from partially or completely overlapping peptides. Analysis of MS/MS spectra from multiple overlapping peptides opens up the possibility of assembling MS/MS spectra into entire proteins, similarly to the assembly of overlapping DNA reads into entire genomes. This online software tool recovers all or parts of the protein sequence through clustering, pairwise alignment, assembly and de-novo interpretation of the input MS/MS spectra.

Read more

MS Dictionary

Sangtae Kim, Nitin Gupta and Pavel Pevzner. Spectral Dictionaries: Integrating De Novo Peptide Sequencing with Database Search of Tandem Mass Spectra. Submitted.

MS-Dictionary is a software to generate all plausible de novo interpretations of a tandem mass spectrum(spectral dictionary) and matches them against a protein database quickly. It enables proteogenomic searches in six-frame translation of genomic sequences that may be prohibitively time-consuming for existing database search approaches.

Read more

MS GeneratingFunction

Sangtae Kim, Nitin Gupta and Pavel Pevzner. The Generating Function of Tandem Mass Spectra: a Database Independent Approach to Evaluating Statistical Significance of Peptide Identifications. Submitted.

A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (like spectral energy) represent new features of tandem mass spectra that significantly improve peptide identifications. We further describe how to efficiently compute the generating function of a tandem mass spectrum and to use it for rigorous computing of error rates in MS/MS searches. This improves the sensitivity-specificity trade-off of existing MS/MS search tools, addresses the notoriously difficult problem of 'one-hit-wonders' in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in typical MS/MS searches.

Read more | Download

Spectral Networks for Identification of Proteins and Post-Translational Modifications

Bandeira N, Tsur D, Frank A, Pevzner PA. Protein identification by spectral networks analysis. Proc Natl Acad Sci U S A. 2007 Apr 10;104(15):6140-5. Epub 2007 Apr 2.

Spectral networks are based on a new idea that allows one to perform MS/MS database search . . . without ever comparing a spectrum against a database. Spectral networks capitalize on spectral pairs - pairs of spectra obtained from overlapping (often non-tryptic) peptides or from unmodified and modified versions of the same peptide. While seemingly redundant, spectral pairs open up computational avenues that were never explored before. Having a spectrum of a modified peptide paired with a spectrum of an unmodified peptide, allows one to separate the prefix and suffix ladders, to greatly reduce the number of noise peaks, and to generate a small number of peptide reconstructions that are likely to contain the correct one. The MS/MS database search is thus reduced to extremely fast pattern matching (rather than time-consuming matching of spectra against databases). In addition to speed, this approach provides a new paradigm for identifying post-translational modifications and higly modified peptides.

Download: Open Source | Windows

PepNovo - de novo peptide sequencing tool

Frank A, Tanner S, Bafna V, Pevzner P. Peptide sequence tags for fast database search in mass-spectrometry. J Proteome Res. 2005 Jul-Aug;4(4):1287-95.

Frank A, Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Anal Chem. 2005 Feb 15;77(4):964-73.

PepNovo is a software for de novo sequencing of peptides from mass spectra. PepNovo uses a probabilistic network to model the peptide fragmentation events in a mass spectrometer. In addition, it uses a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under the fragmentation model, than under a probabilistic model that treats the appearance of peaks as random events.

Run it | Download Win32 Executable | Source

PepNovoDB - A de novo peptide sequencing and identification tool for precision mass spectrometry.

Frank AM, Savitski MM, Nielsen ML, Zubarev RA, Pevzner PA. De novo peptide sequencing and identification with precision mass spectrometry. J Proteome Res. 2007 Jan;6(1):114-23.

The recent proliferation of novel mass spectrometers such as Fourier transform, QTOF, and OrbiTrap marks a transition into the era of precision mass spectrometry, providing a 2 orders of magnitude boost to the mass resolution, as compared to low-precision ion-trap detectors. We investigate peptide de novo sequencing by precision mass spectrometry and explore some of the differences when compared to analysis of low-precision data. We demonstrate how the dramatically improved performance of de novo sequencing with precision mass spectrometry paves the way for novel approaches to peptide identification that are based on direct sequence lookups, rather than comparisons of spectra to a database. With the direct sequence lookup, it is not only possible to search a database very efficiently, but also to use the database in novel ways, such as searching for products of alternative splicing or products of fusion proteins in cancer.

Download source

Uniform Projection Motif Finder

Raphael B, Liu LT, Varghese G. A uniform projection method for motif discovery in DNA sequences. IEEE/ACM Trans Comput Biol Bioinform. 2004 Apr-Jun;1(2):91-4.

Buhler and Tompa (2002) introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases.

Download source

Comparative genomics motif finding

Jones NC, Pevzner PA. Comparative genomics reveals unusually long motifs in mammalian genomes. Bioinformatics. 2006 Jul 15;22(14):e236-42.

The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element). Existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged "non-alignable" flanking regions. In this website, we demonstrate that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions.

Go to supporting website

Alu Repeat Subfamily Classification

Price AL, Eskin E, Pevzner PA. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 2004 Nov;14(11):2245-52.

Alu repeats are the most abundant family of repeats in the human genome, with over 1 million copies comprising 10% of the genome. They have been implicated in human genetic disease and in the enrichment of gene-rich segmental duplications in the human genome, and form a rich fossil record of primate and human history. Alu repeat elements are believed to have arisen from the replication of a small number of source elements, whose evolution over time gives rise to the 31 Alu subfamilies currently reported in Repbase Update. We apply a novel method to identify and statistically validate 213 Alu subfamilies. We build an evolutionary tree of these subfamilies, and conclude that the history of Alu evolution is more complex than previous studies had indicated.

Download source

RepeatScout - de novo identification of repeat families via extension of consensus seeds

Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005 Jun;21 Suppl 1:i351-8.

De novo repeat family identification is a challenging problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. Our RepeatScout algorithm is more sensitive and orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that roughly 2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence.

Download source

ABA: A program for multiple sequence alignment

Raphael B, Zhi D, Tang H, Pevzner P. A novel method for multiple alignment of sequences with repeated and shuffled elements. Genome Res. 2004 Nov;14(11):2336-46.

A-Bruijn Aligner (ABA) is a new method for multiple sequence alignment that represents an alignment as a directed graph, possibly contains cycles. This representation permits alignments that contain repeated and/or shuffled elements.


EULER-AIR: checking base-assignment errors in repeat regions of fragment assembly

Zhi D, Keich U, Pevzner P, Heber S, Tang H. Correcting base-assignment errors in repeat regions of shotgun assembly. IEEE/ACM Trans Comput Biol Bioinform. 2007 Jan-Mar;4(1):54-64.

Read assignment in repeat regions are often ambiguous, causing errors in base-assignment in the final assembly. EULER_AIR improves the read-assignment using an EM-procedure. Taking an assembled genomic sequence and the associated trace data, EULER_AIR can (i) discover and correct base-assignment errors; (ii) provide accurate read assignments; (iii) utilize finishing reads for accurate base assignment; and (iv) provide guidance for designing finishing experiments.

Download source