About us
People
Seminars & Meetings
Online Software
EULER
GRIMM
MGR
InvChecker
MULTIPROFILER
Splicing graphs
PatternBranching
EULER short reads assembler
MS/MS spectra assembly
Spectral Networks
PepNovo
PepNovoDB
Uniform Projection Motif Finder
Comparative genomics motif finding
Alu Repeat Subfamily Classification
RepeatScout
ABA
EULER-AIR
Subgroups
Press Coverage
Publications
Internal: Group members only

Software and Web Tools

EULER - Genome Assembler
GRIMM - Pairwise Genome Rearrangement Analysis
MGR - Multiple Genome Rearrangement Analysis
InvChecker: Micro-inversion Finder
MULTIPROFILER - subtle motif finder (download)
Splicing graphs - a new EST assembly approach
PatternBranching/ProfileBranching - Finding subtle motifs by branching from sample strings
EULER - Short reads assembler
MS/MS spectra assembly
Spectral Networks
PepNovo - de novo peptide sequencing tool
PepNovoDB
Uniform Projection Motif Finder
Comparative genomics motif finding
Alu Repeat Subfamily Classification
RepeatScout - de novo identification of repeat families via extension of consensus seeds
ABA
EULER-AIR

EULER - Genome Assembler

Euler is a new approach to fragment assembly that abandons the classical "overlap - layout - consensus" paradigm that is used in all currently available assembly tools, in favor of a new Eulerian Superpath approach that, for the first time, resolves the problem of repeats in fragment assembly. EULER, in contrast to the Celera assembler, does not mask repeats but uses them instead as a powerful fragment assembly tool. This EULER server provides an online tool to assemble DNA sequences.

Description | Run it | Instructions

GRIMM - Genome Rearrangements in Man and Mouse (and other pairwise genome rearrangements)

Two genomes may have many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via reversals, translocation, fission, and fusion. We provide a web server combining rearrangement algorithms for unichromosomal and multichromosomal genomes, with either signed or unsigned gene data. In each case, it computes the minimum possible number of rearrangement steps, and determines a possible scenario taking this number of steps.

Description | Run it | Instructions

MGR - Multiple Genome Rearrangements

Several genomes derived from a common ancestor may share many genes, but they may be arranged in a different sequence and be located on different chromosomes. The multiple genome rearrangement problem is to determine an optimal phylogenetic tree showing how a given collection of genomes may have evolved from a common ancestor.

Description | Run it | Instructions |

InvChecker: Micro-inversion Finder

In the past, the problem of phylogenetic tree reconstruction was mainly addressed by studies of highly homoplasic point mutations, thus making it difficult to resolve short branches of phylogenetic trees. We show that micro-inversions provide a new rich source of nearly perfect (homeoplasy-free) evolutionary characters that can be used as ``certificates" to verify different branches in a phylogenetic tree. Moreover, switching from point mutations to micro-inversions turns the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of micro-inversions in genomes of mammals from the NISC Comparative Sequencing Program, an untapped source of new phylogenetic characters.

InvChecker finds which reverse strand local alignments from a whole genome comparison correspond to true inversions.

(Availability pending acceptance of submitted manuscript)

MULTIPROFILER - subtle motif finder

Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem. MULTIPROFILER utilizes multiprofiles which generalize the notion of a profile to detect subtle patterns that might escape detection by the standard profiles. A detailed description of the algorithm will be published in Bioinformatics soon. The source code is available for download as is. Unfortunately, due to resources constraint no support can be offered.

Download source

Splicing graphs - a new EST assembly approach

Alternative splicing has many implications for physiology, development and the genesis of diseases. One important approach to investigate alternative splicing is to assemble expressed sequence tags (ESTs) into a list of consensus sequences and to analyze them subsequently. Unfortunately, the EST assembly problem usually has multiple solutions. Instead of exhaustively listing these solutions, the splicing graph integrates the ESTs of a gene and represents all potential assemblies simultaneously as paths in the graph. This yields a compact, unambiguous, and biologically meaningful visualization of the huge EST data, showing all potential splicing variants and their relationships.

PatternBranching/ProfileBranching - Finding subtle motifs by branching from sample strings

Finding subtle motifs by branching from sample strings. Price A., Ramabhadran S., Pevzner P.A. 2003. To appear in Bioinformatics supplementary edition, Proceedings of the Second European Conference on Computational Biology (ECCB-2003). Paris, France.

Abstract:
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both pattern-based and profile-based settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.

Download source

EULER: short reads assembler

The EULER Short assembly package contains two pre-processing programs, one for error correction, 'ecindel', and one for constructing the de-Bruijn graph of a set of reads, 'eqtrans_bal'. The assembly is then performed by euler_et using the equivalent transformation algorithm, described in Pevzner PA, Tang H, Waterman MS, An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA 2001 Aug 14; 98(17):9748-53.

Download source

MS/MS spectra assembly

The analysis of mass spectrometry data is still largely based on identification of single MS/MS spectra and does not attempt to make use of the extra information available in multiple MS/MS spectra from partially or completely overlapping peptides. Analysis of MS/MS spectra from multiple overlapping peptides opens up the possibility of assembling MS/MS spectra into entire proteins, similarly to the assembly of overlapping DNA reads into entire genomes. This online software tool recovers all or parts of the protein sequence through clustering, pairwise alignment, assembly and de-novo interpretation of the input MS/MS spectra.

Spectral Networks for Identification of Proteins and Post-Translational Modifications

Spectral networks are based on a new idea that allows one to perform MS/MS database search . . . without ever comparing a spectrum against a database. Spectral networks capitalize on spectral pairs - pairs of spectra obtained from overlapping (often non-tryptic) peptides or from unmodified and modified versions of the same peptide. While seemingly redundant, spectral pairs open up computational avenues that were never explored before. Having a spectrum of a modified peptide paired with a spectrum of an unmodified peptide, allows one to separate the prefix and suffix ladders, to greatly reduce the number of noise peaks, and to generate a small number of peptide reconstructions that are likely to contain the correct one. The MS/MS database search is thus reduced to extremely fast pattern matching (rather than time-consuming matching of spectra against databases). In addition to speed, this approach provides a new paradigm for identifying post-translational modifications and higly modified peptides.

Software will be available soon - Click here to register and receive an email as soon as the Spectral Networks package is available.

PepNovo - de novo peptide sequencing tool

Frank, A. and Pevzner, P. "PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling", Analytical Chemistry 77:964-973, 2005.

PepNovo is a software for de novo sequencing of peptides from mass spectra. PepNovo uses a probabilistic network to model the peptide fragmentation events in a mass spectrometer. In addition, it uses a likelihood ratio hypothesis test to determine if the peaks observed in the mass spectrum are more likely to have been produced under the fragmentation model, than under a probabilistic model that treats the appearance of peaks as random events.

Run it | Download Win32 Executable | Source

PepNovoDB - A peptide identification tool for data from precision mass spectrometry.

(Availability pending acceptance of submitted manuscript)

Download source

Uniform Projection Motif Finder

Buhler and Tompa (2002) introduced the random projection algorithm for the motif discovery problem and demonstrated that this algorithm performs well on both simulated and biological samples. We describe a modification of the random projection algorithm, called the uniform projection algorithm, which utilizes a different choice of projections. We replace the random selection of projections by a greedy heuristic that approximately equalizes the coverage of the projections. We show that this change in selection of projections leads to improved performance on motif discovery problems. Furthermore, the uniform projection algorithm is directly applicable to other problems where the random projection algorithm has been used, including comparison of protein sequence databases.

Download source

Comparative genomics motif finding

The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element). Existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged "non-alignable" flanking regions. In this website, we demonstrate that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions.

Go to supporting website

Alu Repeat Subfamily Classification

Alu repeats are the most abundant family of repeats in the human genome, with over 1 million copies comprising 10% of the genome. They have been implicated in human genetic disease and in the enrichment of gene-rich segmental duplications in the human genome, and form a rich fossil record of primate and human history. Alu repeat elements are believed to have arisen from the replication of a small number of source elements, whose evolution over time gives rise to the 31 Alu subfamilies currently reported in Repbase Update. We apply a novel method to identify and statistically validate 213 Alu subfamilies. We build an evolutionary tree of these subfamilies, and conclude that the history of Alu evolution is more complex than previous studies had indicated.

Download source

RepeatScout - de novo identification of repeat families via extension of consensus seeds

De novo repeat family identification is a challenging problem of great practical importance. As the number of genome sequencing projects increases, there is a pressing need to identify the repeat families present in large, newly sequenced genomes. We develop a new method for de novo identification of repeat families via extension of consensus seeds; our method enables a rigorous definition of repeat boundaries, a key issue in repeat analysis. Our RepeatScout algorithm is more sensitive and orders of magnitude faster than RECON, the dominant tool for de novo repeat family identification in newly sequenced genomes. Using RepeatScout, we estimate that roughly 2% of the human genome and 4% of mouse and rat genomes consist of previously unannotated repetitive sequence.

Download source code, repeat family libraries, and ISMB slides

ABA: A program for multiple sequence alignment

A-Bruijn Aligner (ABA) is a new method for multiple sequence alignment that represents an alignment as a directed graph, possibly contains cycles. This representation permits alignments that contain repeated and/or shuffled elements.

Download

EULER-AIR: checking base-assignment errors in repeat regions of fragment assembly

Read assignment in repeat regions are often ambiguous, causing errors in base-assignment in the final assembly. EULER_AIR improves the read-assignment using an EM-procedure. Taking an assembled genomic sequence and the associated trace data, EULER_AIR can (i) discover and correct base-assignment errors; (ii) provide accurate read assignments; (iii) utilize finishing reads for accurate base assignment; and (iv) provide guidance for designing finishing experiments.