Software and Web Tools
EULER - Genome Assembler
Euler is a new approach to fragment assembly that abandons the classical
"overlap - layout - consensus" paradigm that is used in all currently available
assembly tools, in favor of a new Eulerian Superpath approach that, for the first time,
resolves the problem of repeats in fragment assembly. EULER, in contrast to the Celera
assembler, does not mask repeats but uses them instead as a powerful
fragment assembly tool. This EULER server provides an online tool
to assemble DNA sequences.
Description |
Run it |
Instructions
GRIMM - Genome Rearrangements in Man and Mouse (and other pairwise genome rearrangements)
Two genomes may have many genes in common, but the genes may be
arranged in a different sequence or be moved between chromosomes.
The pairwise genome rearrangement problem is to find an
optimal scenario transforming one genome to another via
reversals, translocation, fission, and fusion.
We provide a web server combining rearrangement
algorithms for unichromosomal and multichromosomal genomes,
with either signed or unsigned gene data.
In each case, it computes the minimum possible number
of rearrangement steps, and determines a possible scenario
taking this number of steps.
Description |
Run it |
Instructions
MGR - Multiple Genome Rearrangements
Several genomes derived from a common ancestor may share many genes,
but they may be arranged in a different sequence and be located on different
chromosomes.
The multiple genome rearrangement problem is to
determine an optimal phylogenetic tree showing how a given
collection of genomes may have evolved from a common ancestor.
Description |
Run it |
Instructions |
InvChecker: Micro-inversion Finder
In the past, the problem of phylogenetic tree reconstruction was mainly addressed by studies of highly homoplasic point mutations, thus making it difficult to resolve short branches of phylogenetic trees. We show that micro-inversions provide a new rich source of nearly perfect (homeoplasy-free) evolutionary characters that can be used as ``certificates" to verify different branches in a phylogenetic tree. Moreover, switching from point mutations to micro-inversions turns the challenging problem of phylogeny reconstruction into a relatively simple algorithmic problem. We estimate that there exist hundreds of thousands of micro-inversions in genomes of mammals from the NISC Comparative Sequencing Program, an untapped source of new phylogenetic characters.
InvChecker finds which reverse strand local alignments from a whole genome comparison correspond to true inversions.
(Availability pending acceptance of submitted manuscript)
MULTIPROFILER - subtle motif finder
Gene activity is often affected by binding transcription factors to short fragments
in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA
sequence is a difficult pattern recognition problem. MULTIPROFILER utilizes multiprofiles which
generalize the notion of a profile to detect
subtle patterns that might escape detection by the standard profiles. A detailed description
of the algorithm will be published in Bioinformatics soon.
The source code is available for download as is. Unfortunately, due to resources constraint no
support can be offered.
Download source
Splicing graphs - a new EST assembly approach
Alternative splicing has many implications for physiology, development and the genesis of
diseases. One important approach to investigate alternative splicing is to assemble expressed sequence tags (ESTs)
into a list of consensus sequences and to analyze them subsequently. Unfortunately, the EST assembly problem
usually has multiple solutions. Instead of exhaustively listing these solutions, the splicing graph integrates the
ESTs of a gene and represents all potential assemblies simultaneously as paths in the graph. This yields a compact,
unambiguous, and biologically meaningful visualization of the huge EST data, showing all potential splicing variants and their
relationships.
Read more
PatternBranching/ProfileBranching - Finding subtle motifs by branching from sample strings
Finding subtle motifs by branching from sample strings.
Price A., Ramabhadran S., Pevzner P.A. 2003. To appear in
Bioinformatics supplementary edition, Proceedings of the
Second European Conference on Computational Biology (ECCB-2003).
Paris, France.
Abstract:
Many motif finding algorithms apply local search techniques to a set of
seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs
sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the
EM algorithm to sample strings, i.e. substrings of the sample.
In the case of subtle motifs, recent benchmarking efforts show that
both random seeds and selected sample strings may never get close
to the globally optimal motif. We propose a new approach which searches
motif space by branching from sample strings, and implement this idea in
both pattern-based and profile-based settings. Our PatternBranching and
ProfileBranching algorithms achieve favorable results relative to other
motif finding algorithms.
Download source
EULER: short reads assembler
The EULER Short assembly package contains two pre-processing programs, one
for error correction, 'ecindel', and one for constructing the de-Bruijn
graph of a set of reads, 'eqtrans_bal'. The assembly is then performed by
euler_et using the equivalent transformation algorithm,
described in Pevzner PA, Tang H, Waterman MS,
An Eulerian path approach to DNA fragment assembly.
Proc. Natl. Acad. Sci. USA 2001 Aug 14; 98(17):9748-53.
Download source
MS/MS spectra assembly
The analysis of mass spectrometry data is still largely
based on identification of single MS/MS spectra and does
not attempt to make use of the extra information available in
multiple MS/MS spectra from partially or completely
overlapping peptides. Analysis of MS/MS spectra from multiple
overlapping peptides opens up the possibility of assembling MS/MS
spectra into entire proteins, similarly to the assembly of
overlapping DNA reads into entire genomes. This online software
tool recovers all or parts of the protein sequence through clustering,
pairwise alignment, assembly and de-novo interpretation of the input
MS/MS spectra.
Read more
Spectral Networks for Identification of Proteins and Post-Translational Modifications
Spectral networks are based on a new idea that allows one to perform MS/MS database search . . .
without ever comparing a spectrum against a database. Spectral networks capitalize on spectral pairs -
pairs of spectra obtained from overlapping (often non-tryptic) peptides or from unmodified and modified
versions of the same peptide. While seemingly redundant, spectral pairs open up computational avenues
that were never explored before. Having a spectrum of a modified peptide paired with a spectrum of an
unmodified peptide, allows one to separate the prefix and suffix ladders, to greatly reduce the number of
noise peaks, and to generate a small number of peptide reconstructions that are likely to contain the
correct one. The
MS/MS database search is thus reduced to extremely fast pattern matching (rather than time-consuming
matching of spectra against databases). In addition to speed, this approach provides a new paradigm for
identifying post-translational modifications and higly modified peptides.
Software will be available soon - Click here to register and receive an email as soon as the Spectral Networks package is available.
PepNovo - de novo peptide sequencing tool
Frank, A. and Pevzner, P. "PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling", Analytical Chemistry 77:964-973, 2005.
PepNovo is a software for de novo sequencing of peptides from mass spectra.
PepNovo uses a probabilistic network to model the peptide fragmentation
events in a mass spectrometer. In addition, it uses a likelihood ratio
hypothesis test to determine if the peaks observed in the mass spectrum are
more likely to have been produced under the fragmentation model, than under
a probabilistic model that treats the appearance of peaks as random events.
Run it | Download Win32 Executable | Source
PepNovoDB - A peptide identification tool for data from precision mass spectrometry.
(Availability pending acceptance of submitted manuscript)
Download source
Uniform Projection Motif Finder
Buhler and Tompa (2002) introduced the random projection algorithm for
the motif discovery problem and demonstrated that this
algorithm performs well on both simulated and biological samples.
We describe a modification
of the random projection algorithm, called the uniform projection
algorithm, which utilizes a different choice of projections. We
replace the random selection of projections by a greedy heuristic
that approximately equalizes the coverage of the projections. We
show that this change in selection of projections leads to improved
performance on motif discovery problems. Furthermore, the uniform
projection algorithm is directly applicable to other problems where
the random projection algorithm has been used,
including comparison of protein sequence
databases.
Download source
Comparative genomics motif finding
The recent discovery of the first small modulatory RNA (smRNA) presents the challenge of finding other molecules of similar length and conservation level. Unlike short interfering RNA (siRNA) and micro-RNA (miRNA), effective computational and experimental screening methods are not currently known for this species of RNA molecule, and the discovery of the one known example was partly fortuitous because it happened to be complementary to a well-studied DNA binding motif (the Neuron Restrictive Silencer Element). Existing comparative genomics approaches (e.g., phylogenetic footprinting) rely on alignments of orthologous regions across multiple genomes. This approach, while extremely valuable, is not suitable for finding motifs with highly diverged "non-alignable" flanking regions. In this website, we demonstrate that several unusually long and well conserved motifs can be discovered de novo through a comparative genomics approach that does not require an alignment of orthologous upstream regions.
Go to supporting website
Alu Repeat Subfamily Classification
Alu repeats are the most abundant family of repeats in the human genome,
with over 1 million copies comprising 10% of the genome.
They have been implicated in human genetic disease and in the
enrichment of gene-rich segmental duplications in the human genome,
and form a rich fossil record of primate and human history.
Alu repeat elements are believed to have arisen from the replication
of a small number of source elements, whose evolution over time
gives rise to the 31 Alu subfamilies currently reported in Repbase Update.
We apply a novel method to identify
and statistically validate 213 Alu subfamilies. We build an
evolutionary tree of these subfamilies,
and conclude that the history of Alu evolution is more complex
than previous studies had indicated.
Download source
RepeatScout - de novo identification
of repeat families via extension of consensus seeds
De novo repeat family identification is a challenging problem of
great practical importance. As the number of genome sequencing projects
increases, there is a pressing need to identify the repeat families present
in large, newly sequenced genomes. We develop a new method for de novo
identification of repeat families via extension of consensus seeds; our
method enables a rigorous definition of repeat boundaries, a key issue
in repeat analysis. Our RepeatScout algorithm is more sensitive and
orders of magnitude faster than RECON, the dominant tool for de novo
repeat family identification in newly sequenced genomes. Using
RepeatScout, we estimate that roughly 2% of the human genome and
4% of mouse and rat genomes consist of previously unannotated
repetitive sequence.
Download source code, repeat family libraries, and ISMB slides
ABA: A program for multiple sequence alignment
A-Bruijn Aligner (ABA) is a new method for multiple sequence alignment
that represents an alignment as a directed graph, possibly contains
cycles. This representation permits alignments that contain repeated
and/or shuffled elements.
Download
EULER-AIR: checking base-assignment errors in repeat regions of fragment assembly
Read assignment in repeat regions are often ambiguous, causing errors in
base-assignment in the final assembly. EULER_AIR improves the read-assignment
using an EM-procedure. Taking an assembled genomic sequence and the
associated trace data, EULER_AIR can (i) discover and correct base-assignment
errors; (ii) provide accurate read assignments; (iii) utilize finishing reads
for accurate base assignment; and (iv) provide guidance for designing
finishing experiments.
Read More
|