 |
|
 |
|
Recent Papers
|
Genome Res. 2003 Jan;13(1):37-45.
|
|
Genome rearrangements in mammalian evolution: lessons from human and mouse genomes.
|
Pevzner P, Tesler G.
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093-0114, USA. ppevzner@cs.ucsd.edu
Although analysis of genome rearrangements was pioneered by Dobzhansky and Sturtevant 65 years ago, we still know very little about the rearrangement events that produced the existing varieties of genomic architectures. The genomic sequences of human and mouse provide evidence for a larger number of rearrangements than previously thought and shed some light on previously unknown features of mammalian evolution. In particular, they reveal that a large number of microrearrangements is required to explain the differences in draft human and mouse sequences. Here we describe a new algorithm for constructing synteny blocks, study arrangements of synteny blocks in human and mouse, derive a most parsimonious human-mouse rearrangement scenario, and provide evidence that intrachromosomal rearrangements are more frequent than interchromosomal rearrangements. Our analysis is based on the human-mouse breakpoint graph, which reveals related breakpoints and allows one to find a most parsimonious scenario. Because these graphs provide important insights into rearrangement scenarios, we introduce a new visualization tool that allows one to view breakpoint graphs superimposed with genomic dot-plots.
|
Pac Symp Biocomput. 2003;:29-40.
|
|
Genome-wide analysis of bacterial promoter regions.
|
Eskin E, Keich U, Gelfand MS, Pevzner PA.
Department of Computer Science, Columbia University, New York, NY 10027, USA. eeskin@cs.columbia.edu
Identifying prokaryotic promoter sequences is notoriously difficult and for most sequenced bacterial genomes the promoter sequences are still unknown. Since experimental analysis trails behind sequencing, genome-wide computational promoter discovery is often the only realistic way to discover these sequences in newly sequenced bacterial genomes. However, genome-wide samples for promoter discovery may be very large and corrupted complicating promoter discovery. We discuss three aspects of genome-wide promoter discovery: sample generation, signal finding algorithms, and scoring signals. We applied our new MITRA algorithm to analyze samples of divergent and convergent genes in 20 bacterial genomes and found strong putative dyad signals in 17 out of the 20 genomes. Moreover, in 12 out of 20 genomes the found signals are identical or similar to the known regulatory patterns (Pribnow-Gilbert boxes and CRP binding sites). Since many of putative signals correspond to previously known elements of bacterial transcriptional regulation, the remaining discovered signals are good candidates for unknown regulatory elements.
|
Nature 2002 Dec 5;420(6915):520-62
|
|
Initial sequencing and comparative analysis of the mouse genome.
|
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S, Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR, Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Davies J, Delehaunty KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P, Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gibbs RA, Glusman G, Gnerre S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M, Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Johnson LS, Jones M, Jones TA, Joy A, Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S, Letunic I, Levine R, Li J, Li M, Lloyd C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L, Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD, Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan M, Mott R, Mullikin JC, Muzny DM, Nash WE, Nelson JO, Nhan MN, Nicol R, Ning Z, Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Overton-Larty E, Pachter L, Parra G, Pepin KH, Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail M, Reymond A, Roe BA, Roskin KM, Rubin EM, Rust AG, Santos R, Sapojnikov V, Schultz B, Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A, Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A, Stange-Thomann N, Sugnet C, Suyama M, Tesler G, Thompson J, Torrents D, Trevaskis E, Tromp J, Ucla C, Ureta-Vidal A, Vinson JP, Von Niederhausern AC, Wade CM, Wall M, Weber RJ, Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey D, Williams S, Wilson RK, Winter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM, Zody MC, Lander ES.
[1] Genome Sequencing Center, Washington University School of Medicine, Campus Box 8501, 4444 Forest Park Avenue, St Louis, Missouri 63108, USA; [2] Members of the Mouse Genome Analysis Group.
The sequence of the mouse genome is a key informational tool for understanding the contents of the human genome and a key experimental tool for biomedical research. Here, we report the results of an international collaboration to produce a high-quality draft sequence of the mouse genome. We also present an initial comparative analysis of the mouse and human genomes, describing some of the insights that can be gleaned from the two sequences. We discuss topics including the analysis of the evolutionary forces shaping the size, structure and sequence of the genomes; the conservation of large-scale synteny across most of the genomes; the much lower extent of sequence orthology covering less than half of the genomes; the proportions of the genomes under selection; the number of protein-coding genes; the expansion of gene families related to reproduction and immunity; the evolution of proteins; and the identification of intraspecies polymorphism.
|
Science 2002 Nov 29;298(5599):1747-52
|
|
Corepressor-dependent silencing of chromosomal regions encoding neuronal genes.
|
Lunyak VV, Burgess R, Prefontaine GG, Nelson C, Sze SH, Chenoweth J, Schwartz P, Pevzner PA, Glass C, Mandel G, Rosenfeld MG.
Howard Hughes Medical Institute (HHMI), Department of Computer Science and Engineering, School of Medicine, University of California, San Diego, 9500 Gilman Drive, Room 345, La Jolla, CA 92093-0648, USA.
The molecular mechanisms by which central nervous system-specific genes are expressed only in the nervous system and repressed in other tissues remain a central issue in developmental and regulatory biology. Here, we report that the zinc-finger gene-specific repressor element RE-1 silencing transcription factor/neuronal restricted silencing factor (REST/NRSF) can mediate extraneuronal restriction by imposing either active repression via histone deacetylase recruitment or long-term gene silencing using a distinct functional complex. Silencing of neuronal-specific genes requires the recruitment of an associated corepressor, CoREST, that serves as a functional molecular beacon for the recruitment of molecular machinery that imposes silencing across a chromosomal interval, including transcriptional units that do not themselves contain REST/NRSF response elements.
|
Bioinformatics 2002 Oct;18(10):1382-90
|
|
Subtle motifs: defining the limits of motif finding algorithms.
|
Keich U, Pevzner PA.
Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, USA. keich@cs.ucsd.edu
MOTIVATION: What constitutes a subtle motif? Intuitively, it is a motif that is almost indistinguishable, in the statistical sense, from random motifs. This question has important practical consequences: consider, for example, a biologist that is generating a sample of upstream regulatory sequences with the goal of finding a regulatory pattern that is shared by these sequences. If the sequences are too short then one risks losing some of the regulatory patterns that are located further upstream. Conversely, if the sequences are too long, the motif becomes too subtle and one is then likely to encounter random motifs which are at least as significant statistically as the regulatory pattern itself. In practical terms one would like to recognize the sequence length threshold, or the twilight zone, beyond which the motifs are in some sense too subtle. RESULTS: The paper defines the motif twilight zone where every motif finding algorithm would be exposed to random motifs which are as significant as the one which is sought. We also propose an objective tool for evaluating the performance of subtle motif finding algorithms. Finally we apply these tools to evaluate the success of our MULTIPROFILER algorithm to detect subtle motifs.
|
Bioinformatics 2002 Oct;18(10):1374-81
|
|
Finding motifs in the twilight zone.
|
Keich U, Pevzner PA.
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA 92093, USA. keich@cs.ucsd.edu
MOTIVATION: Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem. In this paper we design a new motif finding algorithm that can detect very subtle motifs. RESULTS: We introduce the notion of a multiprofile and use it for finding subtle motifs in DNA sequences. Multiprofiles generalize the notion of a profile and allow one to detect subtle patterns that escape detection by the standard profiles. Our MULTIPROFILER algorithm outperforms other leading motif finding algorithms in a number of synthetic models. Moreover, it can be shown that in some previously studied motif models, MULTIPROFILER is capable of pushing the performance envelope to its theoretical limits. AVAILABILITY: http://www-cse.ucsd.edu/groups/bioinformatics/software.html
|
Adv Biochem Eng Biotechnol 2002;77:1-19
|
|
Combinatorial algorithms for design of DNA arrays.
|
Hannenhalli S, Hubell E, Lipshutz R, Pevzner PA.
Department of Mathematics, University of Southern California, Los Angeles 90089-1113, USA.
Optimal design of DNA arrays requires the development of algorithms with two-fold goals: reducing the effects caused by unintended illumination (border length minimization problem) and reducing the complexity of masks (mask decomposition problem). We describe algorithms that reduce the number of rectangles in mask decomposition by 20-30% as compared to a standard array design under the assumption that the arrangement of oligonucleotides on the array is fixed. This algorithm produces provably optimal solution for all studied real instances of array design. We also address the difficult problem of finding an arrangement which minimizes the border length and come up with a new idea of threading that significantly reduces the border length as compared to standard designs.
|
Bioinformatics 2002 Jul;18 Suppl 1:S354-63
|
|
Finding composite regulatory patterns in DNA sequences.
|
Eskin E, Pevzner PA.
Department of Computer Science, Columbia University, New York, 10027 NY Department of Computer Science and Engineering, University of California at San Diego, La Jolla 92093-0114, CA.
Pattern discovery in unaligned DNA sequences is a fundamental problem in computational biology with important applications in finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad patterns that occur near each other. A difficulty in discovering composite patterns is that one or both of the component monad patterns in the group may be 'too weak'. Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. In this paper, we present a MITRA (MIsmatch TRee Algorithm) approach for discovering composite signals. We demonstrate that MITRA performs well for both monad and composite patterns by presenting experiments over biological and synthetic data. Availability: MITRA is available at http://www.cs.columbia.edu/compbio/mitra/ Contact: eeskin@cs.columbia.edu Keywords: regulatory motif finding; pattern finding; dyad motifs.
|
Bioinformatics 2002 Jul;18 Suppl 1:S181-8
|
|
Splicing graphs and EST assembly problem.
|
Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA.
Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, 92093-0114, USA.
Motivation: The traditional approach to annotate alternative splicing is to investigate every splicing variant of the gene in a case-by-case fashion. This approach, while useful, has some serious shortcomings. Recent studies indicate that alternative splicing is more frequent than previously thought and some genes may produce tens of thousands of different transcripts. A list of alternatively spliced variants for such genes would be difficult to build and hard to analyse. Moreover, such a list does not show the relationships between different transcripts and does not show the overall structure of all transcripts. A better approach would be to represent all splicing variants for a given gene in a way that captures the relationships between different splicing variants. Results: We introduce the notion of the splicing graph that is a natural and convenient representation of all splicing variants. The key difference with the existing approaches is that we abandon the linear (sequence) representation of each transcript and replace it with a graph representation where each transcript corresponds to a path in the graph. We further design an algorithm to assemble EST reads into the splicing graph rather than assembling them into each splicing variant in a case-by-case fashion. Availability: http://www-cse.ucsd.edu/groups/bioinformatics/software.html Contact: sheber@ucsd.edu Keywords: EST assembly; splicing graph; alternative splicing
|
Pac Symp Biocomput 2002;:235-46
|
|
Finding weak motifs in DNA sequences.
|
Sze SH, Gelfand MS, Pevzner PA.
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093-0114, USA.
Recognition of regulatory sites in unaligned DNA sequences is an old and well-studied problem in computational molecular biology. Recently, large-scale expression studies and comparative genomics brought this problem into a spotlight by generating a large number of samples with unknown regulatory signals. Here we develop algorithms for recognition of signals in corrupted samples (where only a fraction of sequences contain sites) with biased nucleotide composition. We further benchmark these and other algorithms on several bacterial and archaeal sites in a setting specifically designed to imitate the situations arising in comparative genomics studies.
|
Pac Symp Biocomput 2002;:199-210
|
|
EULER-PCR: finishing experiments for repeat resolution.
|
Mulyukov Z, Pevzner PA.
Department of Computer Science and Engineering, University of California, San Diego, CA 92093, USA.
Genomic sequencing typically generates a large collection of unordered contigs or scaffolds. Contig ordering (also known as gap closure) is a non-trivial algorithmic and experimental problem since even relatively simple-to-assemble bacterial genomes typically result in large set of contigs. Neighboring contigs maybe separated either by gaps in read coverage or by repeats. In the later case we say that the contigs are separated by pseudogaps, and we emphasize the important difference between gap closure and pseudogap closure. The existing gap closure approaches do not distinguish between gaps and pseudogaps and treat them in the same way. We describe a new fast strategy for closing pseudogaps (repeat resolution). Since in highly repetitive genomes, the number of pseudogaps may exceed the number of gaps by an order of magnitude, this approach provides a significant advantage over the existing gap closure methods.
|
Genome Res 2002 Jan;12(1):26-36
|
|
Genome-scale evolution: reconstructing gene orders in the ancestral species.
|
Bourque G, Pevzner PA.
Department of Mathematics, University of Southern California, California 90089, USA. gbourque@usc.edu
Recent progress in genome-scale sequencing and comparative mapping raises new challenges in studies of genome rearrangements. Although the pairwise genome rearrangement problem is well-studied, algorithms for reconstructing rearrangement scenarios for multiple species are in great need. The previous approaches to multiple genome rearrangement problem were largely based on the breakpoint distance rather than on a more biologically accurate rearrangement (reversal) distance. Another shortcoming of the existing software tools is their inability to analyze rearrangements (inversions, translocations, fusions, and fissions) of multichromosomal genomes. This paper proposes a new multiple genome rearrangement algorithm that is based on the rearrangement (rather than breakpoint) distance and that is applicable to both unichromosomal and multichromosomal genomes. We further apply this algorithm for genome-scale phylogenetic tree reconstruction and deriving ancestral gene orders. In particular, our analysis suggests a new improved rearrangement scenario for a very difficult Campanulaceae cpDNA dataset and a putative rearrangement scenario for human, mouse and cat genomes.
|
Genome Res 2001 Sep;11(9):1461-2
|
|
Assembling puzzles from preassembled blocks.
|
Pevzner PA.
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, California 92093-0114, USA. ppevzner@cs.ucsd.edu
|
Proc Natl Acad Sci U S A 2001 Aug 14;98(17):9748-53
|
|
An Eulerian path approach to DNA fragment assembly.
|
An Eulerian path approach to DNA fragment assembly.
Pevzner PA, Tang H, Waterman MS.
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, USA.
For the last 20 years, fragment assembly in DNA sequencing followed the "overlap-layout-consensus" paradigm that is used in all currently available assembly tools. Although this approach proved useful in assembling clones, it faces difficulties in genomic shotgun assembly. We abandon the classical "overlap-layout-consensus" approach in favor of a new euler algorithm that, for the first time, resolves the 20-year-old "repeat problem" in fragment assembly. Our main result is the reduction of the fragment assembly to a variation of the classical Eulerian path problem that allows one to generate accurate solutions of large-scale sequencing problems. euler, in contrast to the celera assembler, does not mask such repeats but uses them instead as a powerful fragment assembly tool.
|
Bioinformatics 2001;17 Suppl 1:S225-33
|
|
Fragment assembly with double-barreled data.
|
Pevzner PA, Tang H.
Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA. ppevzner@ucsd.edu
For the last twenty years fragment assembly was dominated by the "overlap - layout - consensus" algorithms that are used in all currently available assembly tools. However, the limits of these algorithms are being tested in the era of genomic sequencing and it is not clear whether they are the best choice for large-scale assemblies. Although the "overlap - layout - consensus" approach proved to be useful in assembling clones, it faces difficulties in genomic assemblies: the existing algorithms make assembly errors even in bacterial genomes. We abandoned the "overlap - layout - consensus" approach in favour of a new Eulerian Superpath approach that outperforms the existing algorithms for genomic fragment assembly (Pevzner et al. 2001 InProceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB-01), 256-26). In this paper we describe our new EULER-DB algorithm that, similarly to the Celera assembler takes advantage of clone-end sequencing by using the double-barreled data. However, in contrast to the Celera assembler, EULER-DB does not mask repeats but uses them instead as a powerful tool for contig ordering. We also describe a new approach for the Copy Number Problem: "How many times a given repeat is present in the genome?". For long nearly-perfect repeats this question is notoriously difficult and some copies of such repeats may be "lost" in genomic assemblies. We describe our EULER-CN algorithm for the Copy Number Problem that proved to be successful in difficult sequencing projects.
|
J Comput Biol 2000;7(6):777-87
|
|
Mutation-tolerant protein identification by mass spectrometry.
|
Pevzner PA, Dancik V, Tang CL.
Departments of Computer Science and Engineering, University of California at San Diego, La Jolla, CA 92093, USA.
Database search in tandem mass spectrometry is a powerful tool for protein identification. High-throughput spectral acquisition raises the problem of dealing with genetic variation and peptide modifications within a population of related proteins. A method that cross-correlates and clusters related spectra in large collections of uncharacterized spectra (i.e., from normal and diseased individuals) would be very valuable in functional proteomics. This problem is far from being simple since very similar peptides may have very different spectra. We introduce a new notion of spectral similarity that allows one to identify related spectra even if the corresponding peptides have multiple modifications/mutations. Based on this notion, we developed a new algorithm for mutation-tolerant database search as well as a method for cross-correlating related uncharacterized spectra.
|
Bioinformatics 2001 Apr;17(4):327-37
|
|
A new approach to sequence comparison: normalized sequence alignment.
|
Arslan AN, Egecioglu O, Pevzner PA.
Department of Computer Science, University of California, Santa Barbara, Santa Barbara, CA 93106, USA Department of Computer Science and Engineering, University of California, San Diego, San Diego, CA 92093, USA.
The Smith-Waterman algorithm for local sequence alignment is one of the most important techniques in computational molecular biology. This ingenious dynamic programming approach was designed to reveal the highly conserved fragments by discarding poorly conserved initial and terminal segments. However, the existing notion of local similarity has a serious flaw: it does not discard poorly conserved intermediate segments. The Smith-Waterman algorithm finds the local alignment with maximal score but it is unable to find local alignment with maximum degree of similarity (e.g. maximal percent of matches). Moreover, there is still no efficient algorithm that answers the following natural question: do two sequences share a (sufficiently long) fragment with more than 70% of similarity? As a result, the local alignment sometimes produces a mosaic of well-conserved fragments artificially connected by poorly-conserved or even unrelated fragments. This may lead to problems in comparison of long genomic sequences and comparative gene prediction as recently pointed out by Zhang et al. (Bioinformatics, 15, 1012-1019, 1999). In this paper we propose a new sequence comparison algorithm (normalized local alignment ) that reports the regions with maximum degree of similarity. The algorithm is based on fractional programming and its running time is O(n2log n). In practice, normalized local alignment is only 3-5 times slower than the standard Smith-Waterman algorithm.
|
Genome Res 2001 Feb;11(2):290-9
|
|
Efficiency of database search for identification of mutated and modified proteins via mass spectrometry.
|
Pevzner PA, Mulyukov Z, Dancik V, Tang CL.
Department of Computer Science and Engineering, University of California, San Diego, California 92093, USA. ppevzner@hto.usc.edu
Although protein identification by matching tandem mass spectra (MS/MS) against protein databases is a widespread tool in mass spectrometry, the question about reliability of such searches remains open. Absence of rigorous significance scores in MS/MS database search makes it difficult to discard random database hits and may lead to erroneous protein identification, particularly in the case of mutated or post-translationally modified peptides. This problem is especially important for high-throughput MS/MS projects when the possibility of expert analysis is limited. Thus, algorithms that sort out reliable database hits from unreliable ones and identify mutated and modified peptides are sought. Most MS/MS database search algorithms rely on variations of the Shared Peaks Count approach that scores pairs of spectra by the peaks (masses) they have in common. Although this approach proved to be useful, it has a high error rate in identification of mutated and modified peptides. We describe new MS/MS database search tools, MS-CONVOLUTION and MS-ALIGNMENT, which implement the spectral convolution and spectral alignment approaches to peptide identification. We further analyze these approaches to identification of modified peptides and demonstrate their advantages over the Shared Peaks Count. We also use the spectral alignment approach as a filter in a new database search algorithm that reliably identifies peptides differing by up to two mutations/modifications from a peptide in a database.
|
Proc Int Conf Intell Syst Mol Biol 2000;8:269-78
|
|
Combinatorial approaches to finding subtle signals in DNA sequences.
|
Pevzner PA, Sze SH.
Department of Mathematics, University of Southern California, Los Angeles 90089-1113, USA. ppevzner@hto.usc.edu
Signal finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. Despite many studies, this problem is far from being resolved: most signals in DNA sequences are so complicated that we don't yet have good models or reliable algorithms for their recognition. We complement existing statistical and machine learning approaches to this problem by a combinatorial approach that proved to be successful in identifying very subtle signals.
|
|
|
|
 |