# Past Seminars

## Linking Gene Expression and Regulation by Correspondence Analysis

Speaker: Martin Vingron, Max Planck Institute, Berlin
Date: Thursday, August 7, 2003
Time: 4:00pm-5:00pm
Place: APM 4301
Host: Pavel Pevzner

ABSTRACT
This talk will introduce correspondence analysis as a tool for visualizing associations between genes and conditions in DNA-micro array data. The same technique will also be applied for establishing associations between gene expression data and transcription factor binding sites. While for yeast this can be done based on published transcription factor binding data, for human data we draw on a comparative analysis with mouse data in search for binding sites.

Short Bio: Martin Vingron studied mathematics in Vienna and did his PhD at the EMBL Biocomputing Programme. After postdoctoral work first at the University of Southern California and then at the German National Research Center for Computer Science he became Head of the Theoretical Bioinformatics Division at the German Cancer Research Center in Heidelberg. In 2000 he moved to Berlin to take the position of Director at the Max Planck Institute for Molcular Genetics, heading the Computational Molecular Biology Departement there. He holds a Honorary Professorship at Berlin Free University.
His main research interests lie in sequence analysis and comparison, in the analysis of gene expression data, and in the study of patterns involved in transcriptional regulation.

## Algorithmic Tools for Breaking the Cis-Regulation Code

Speaker: Richard Karp
Date: Monday, May 12, 2003
Time: 4:00pm-5:00pm
Place: APM 4301
Host: Pavel Pevzner

ABSTRACT
Although much computational effort has been devoted to identifying the genes within a genomic sequence, there has been less effort devoted to identifying the regulatory sequence elements that are recognized and bound by transcription factors that control gene transcription. These sequence motifs make up an immensely rich codebook of the gene regulation program, known as the cis-regulatory system. Several recently developed algorithms for finding such motifs use statistical models that take into account specific structural features of transcription factors and their binding sites. We describe one such algorithm, due to Xing, Jordan, Wu and the speaker. Recent findings suggest a modular organization in which sets of trancription factors, binding to sites in a contiguous region of the genome, cooperate in regulating gene expression. We describe algorithms by Sharan,Ovcharenko, Ben-Hur and the speaker for identifying such regulatory modules.

Speaker:  Dr. William Murphy
Date:  January 13, 2003
Time: 4:00pm - 5:00pm
Place: APM Building 4301
Host:  Professor Pavel Pevzner

ABSTRACT
Comparative genomic analysis of mammals offers the potential to interpret the evolutionary dynamics of gene organization within chromosomes, to determine the mechanisms of chromosomal rearrangements, and to reveal the forces governing the conservation of synteny. Major improvements have been made towards understanding the evolutionary relationships and timescale of placental mammal diversification, analyzing large concatenations of nuclear gene segments with maximum likelihood and Bayesian phylogenetic methods. These results emphasize the need for increased large scale genomic analysis from many mammalian lineages to better understand and annotate the human genome. Examining available whole genome maps from representatives of different orders of placental mammals I will discuss evolutionary properties on a chromosomal and whole genome scale, including results based upon multiple genome rearrangement algorithms developed by Pavel Pevzner & colleagues. Using these data we are closing in on determining a probable ancestral genome of placental mammals.

Dr. Murphy received his doctoral degree from the University of Tulsa, where he studied molecular phylogenetics and evolution of tropical killifishes. He entered the Laboratory of Genomic Diversity (LGD) at the National Cancer Institute in Frederick, Maryland as a post-doctoral fellow where he worked on domestic cat comparative genomics under Steve O�Brien. He is currently a research scientist in the LGD working on mammalian phylogenetics and comparative genomics.

## Engineering a Scalable Placement Heuristic for DNA Probe Arrays

Speaker:Dr. Andrew Kahng
Date:December 13, 2002
Time:2pm - 3pm
Place:APM 4301
Host:Pavel Pevzner

ABSTRACT
Design of DNA arrays for very large-scale immobilized polymer synthesis (VLSIPS) seeks to minimize effects of unintended illumination during mask exposure steps. Hannenhalli et al. and Kahng et al. (WABI-2002) formulate this requirement as the Border Minimization Problem and give methods for placement (at array sites) and embedding (in the mask sequence) of probes in both synchronous and asynchronous regimes. These previous methods do not address several practical details of the application and, more critically, are not scalable to the O(10^8) probes contemplated for next-generation probe arrays. In this work, we make two main contributions. (1) We give improved dynamic programming algorithms that perform probe embedding to minimize border conflicts while accounting for distance- and position-dependent border conflict weights, as well as the presence of SNPs in the instance. (2) We describe and experimentally validate the "engineering" of a scalable, high-quality asynchronous placement heuristic (which is moreover embarrassingly parallelizable) for DNA array design. In the process, we draw a number of useful parallels from the 40-year history of placement heuristics in VLSI CAD. Experimental results confirm that our approach enjoys linear-time scaling while maintaining solution quality that matches the best previous methods. This is a joint work with I. I. Mandoiu, P. A. Pevzner, S. Reda and A. Z. Zelikovsky

Short Bio: Andrew B. Kahng is Professor of CSE and ECE at UC San Diego. He joined UCSD in January 2001, after nearly 12 years on the UCLA computer science faculty. His research focuses on physical design and performance analysis of very large-scale integrated circuits (VLSI), as well as the VLSI design-manufacturing interface.

## Perfect Phylogeny Haplotyping

Speaker: Professor Dan Gusfield, Department of Computer Science University of California, Davis
Date: Wednesday, November 27, 2002
Time: 2:15pm - 3:30pm
Place: AP&M Building, Room 4301
Hosts:  Pavel Pevzner and Uri Keich

ABSTRACT
The next high-priority phase of human genomics will involve the development of a full Haplotype Map of the human genome. It will be used in large-scale screens of populations to associate specific haplotypes with specific complex genetic-influenced diseases. However, most studies will collect genotype data rather than haplotype data, requiring the deduction of haplotype information from genotype data. Input to the haplotyping problem is a set of N genotypes, and output is N haplotype pairs that "explain" the genotypes. Several computational approaches to this problem have been developed and used. Most of these use a statistical framework, such as MLE, or statistical computing methods such as EM, Gibbs sampling etc.

We have developed several distinct combinatorial approaches to the haplotyping problem. I will talk about my most recent approach, based on viewing the haplotyping problem in the context of perfect phylogeny. The biological motivation for this is the surprising fact that genomic DNA can be partitioned into long blocks where genetic recombination has been rare, leading to strikingly fewer distinct haplotypes in the population than previously expected. This, along with assumption of infinite sites implies that any "permitted" solution to the haplotyping problem should fit a perfect phylogeny, This is a severe combinatorial constraint on the permitted solutions to the haplotyping problem, and it leads to efficient deterministic algorithms (I will talk about two of them) to deduce all features of the permitted haplotype solution(s) that can be known with certainty. We obtain a) an efficient algorithm to find (if one exists) one permitted solution to the haplotype problem; b) a simple test to determine if it is the unique permitted solution; c) an efficient way to count the number of permitted solutions; d) and an efficient way to implicitly represent the set of all permitted solutions so that each can be efficiently created. As a by-product, we prove that the number of permitted solutions is bounded by 2^k, where k is less than or equal to the number of sites. This is a huge reduction from the number of solutions if we do not require that a solution fit a perfect phylogeny. This implicit representation also allows simple solutions to secondary questions about the genotypes. For example, in the case that the tree is not unique, the representation explicitly reveals which phase relationships (if determined) would reduce the number of possible trees, and gives a conceptual solution to the problem of minimizing the number of haplotype pairs needed to be laboratory-determined in order to deduce the correct tree. We also observe a strong phase-transition (no pun intended) in the probability that a given number of genotypes will be sufficient to uniquely determine the tree. Parts of this work are joint with different collaborators, including R.H. Chung, V. Bafna, G. Lancia, and S. Yooseph

## Progress on Reconstructing Very Large Evolutionary Trees

Speaker: Dr. Tandy Warnow, University of Texas in Austin
Date: Friday, October 4, 2002
Time: 2:00pm - 3:00pm
Place: AP&M Building, Room 4301
Host: Professor Pavel Pevzner

ABSTRACT
Phylogenetic trees, also known as evolutionary trees, model the evolution of biological species or genes from a common ancestor. Reconstructing evolutionary trees is a fundamental research problem in biology, with applications to protein structure and function prediction, pathway detection, sequence alignment, drug design, etc. However, the reconstruction of very large evolutionary trees is exceedingly difficult; not only are the major optimization problems NP-hard, but large real datasets of interest to the biological community can take years of analysis without being solved exactly. Furthermore, while polynomial time methods for phylogeny reconstruction exist, our research shows that the standard approaches (including the popular Neighbor Joining method) may not perform well on large datasets. In this talk I will present new methods which are more suitable to large-scale phylogeny reconstruction. I will describe new dataset decomposition techniques, called Disk-Covering, which can be used in conjunction with any phylogenetic method. We can prove theoretical performance guarantees, with respect to the sequence length required for accuracy, and we can demonstrate improvements in performance with respect to the running time of algorithms for NP-hard optimization problems, such as maximum parsimony and maximum likelihood. If time permits, I will also present results on our research into inferring evolutionary trees from whole genomes.

Short Bio: Tandy Warnow received her PhD in Mathematics in 1991 at UC Berkeley, under the supervision of Eugene Lawler. She spent two years (first as a postdoc for Michael Waterman (USC, Mathematics) and then at Sandia National Laboratories in New Mexico) before joining the faculty at the Uhiversity of Pennsylvania. In 1999 she joined the faculty at the University of Texas at Austin, where she is now an Associate Professor of Computer Sciences, and Co-Director (with David Hillis) for the Center for Computational Biology and Bioinformatics. Tandy received the National Science Foundation Young Investigator Award in 1994, and the David and Lucille Packard Foundation Award in Science and Engineering in 1996.

## Topological Proteomics

Speaker: Dr. Andreas Dress, University of Bielefeld, Germany
Date: Tuesday, September 17, 2002
Time: 4:00pm - 5:00pm
Place: AP&M Building, Room 4301
Host: Pavel Pevzner/Uri Keich

ABSTRACT
"Topological Proteomics" is a new direction of protein-interaction research that has become feasible due to new techniques developed in fluorescence microscopy called "Multi-Epitop Ligand Cartography" or "MELK" for short (see Schubert, W.: Topological Proteomics, Toponomics, MELK-Technology, in M.Hecker & S.Muellner (eds.): Proteomics of Microorganismus: Fundamental Aspects and Application, Springer Series on "Advances in Biochemical Engeneering/Biotechnology", Springer Verlag, Berlin - Heidelberg - New York, in press, and W.Schubert: Polymyositis, Topological Proteomics Technology and Paradigm for Cell Invasion Dynamics. J Theoret Med: Vol. 4 (1), pp 75-84. 2002).

Given a family of proteins (up to 30, say, at present), this technique allows to measure the (relative) concentration p(j) of protein p at locality j (indexed e.g. by pixel coordinates) within a given piece of tissue or blood sample, resulting in as many images with luminosity values p(j) between, say, 0 and 255 as we have proteins under consideration.

One expects that biologically relevant interactions between those proteins will show up in surprisingly large (or small) numbers of localities for which these concentrations exceed some appropriately chosen threshold values simultaneously. From the detection of such linkages between proteins directly at the cellular site of their biological (inter)action, one has been able to identify those proteins (among the given collection of proteins) that were linked to disease. Direct functional linkage analysis in tissue (or blood cells) based on such data is therefore expected to be superior to searching for linkages in "large scale proteomics" expression profiles that are all based upon extraction of proteins from cells and thus upon the destruction of the topological code that encodes the hierarchy of the proteins under consideration.

In the lecture, an illustrative example will be presented, and some software tools and mathematical techniques and problems relating to the task of analyzing MELK data will be discussed.

Short Bio: Born in Berlin/Germany one year before the second world war began into a family that had four its members being murdered in Nazi prisons by the Gestapo in April 1945, survived until today using all sorts of disguises like geometer(1960-63), number theorist (1964-1966), topologist (1967-1969), algebraist (1970-1979), combinatorialist (1980-2000) and, presently, pseudo-biologist.

## Inferring Regulation from Gene Expression with Probabilistic Graphical Models

Speaker: Dr. Nir Friedman, Hebrew University, Jerusalem
Date: Wednesday, August 28, 2002
Time: 3:00pm - 4:00pm
Place: AP&M Building, Room 4301
Host: Professor Pavel Pevzner

ABSTRACT
In recent years, biological and medical research is undergoing a major revolution. High-throughput methods allow to measure and record large amounts of data. In particular, microarray-based hybridization methods techniques allow to simultaneously measure the expression level of thousands of genes. It is clear that such measurements contain information about many different aspects of gene regulation and function. The computational challenge is to extract new biological understanding from this wealth of data. In particular, to gain insight on the "structure" of the interactions between components in the systems, such as genes, proteins, and external signals.

In this talk, I will describe an ongoing project to address this question using probabilistic graphical models. These models represent the measured expression level of each gene as a random variable and each regulatory interaction as a probabilistic dependency between these variables. In this project, we are developing tools for automatic induction of networks from gene expression data, sequence data, and additional high-throughput measurements. We demonstrate the utility of this approach in analysis of several yeast expression data sets. In particular, our method manages to reconstruct fragments of known pathways from such expression profiles, to provide better understanding of combinatorial regulation in complex processes, and to uncover modules of genes sharing a common "program".

Short Bio: Nir Friedman received a B.Sc. in mathematics and computer science from Tel-Aviv University in 1987, a M.Sc. in mathematics and computer science from the Weizmann Institute of Science in 1992, and a Ph.D. in computer science from Stanford in 1997. From 1996 until 1998 , he was a postdoctoral scholar in the computer science division of the University of California at Berkeley. Since the fall of 1998 he is a faculty member at the Computer Science & Engineering School, The Hebrew University of Jerusalem.

Dr. Friedman's major research interests involve learning and inference with probabilistic models and their applications to computational biology. In recent years, he has been extensively working on learning and inference with probabilistic representations of uncertainty. This work mainly focuses on using Bayesian networks for concept learning, data mining, and reinforcement learning. Since he joined the Hebrew University in 1998, Dr. Friedman's research has focused on application of these tools to computational molecular biology. In particular, his group has examined analysis of gene expression data for cancer classification, discovery of novel genes, pathway reconstruction, and understanding of regulatory circuits.

## Identifying Transcription Factor Binding Sites through Markov Chain Optimization

Speaker:Tao Jiang, Department of Computer Science, UC Riverside
Date: Wednesday, May 8, 2002
Time: 1:30 pm
Place: AP&M Building/Room 4301
Host: Uri Keich

ABSTRACT
Even though every cell in an organism contains the same genetic material, each cell does not express the same cohort of genes. Therefore, one of the major problems facing genomic research today is to determine not only which genes are differentially expressed and under what conditions, but also how the expression of those genes is regulated. The first step in determining differential gene expression is the binding of sequence-specific DNA binding proteins ({\it i.e.} {\em transcription factors}) to regulatory regions of the genes ({\it i.e.} {\em promoters} and {\em enhancers}). An important aspect to understanding how a given transcription factor functions is to know the entire gamut of binding sites and subsequently potential target genes that the factor may bind/regulate. In this study, we have developed a computer algorithm to scan genomic databases for transcription factor binding sites, based on a novel Markov chain optimization method, and used it to scan the human genome for sites that bind to {\em hepatocyte nuclear factor 4} $\alpha$ (HNF4$\alpha$). A list of $71$ known HNF4$\alpha$ binding sites from the literature were used to train our Markov chain model. By looking at the window of $600$ nucleotides around the transcription start site of each confirmed gene on the human genome, we identified $849$ sites with varying binding potential and experimentally tested $109$ of those sites for binding to HNF4$\alpha$. Our results show that the program was very successful in identifying $77$ new HNF4$\alpha$ binding sites with varying binding affinities ({\it i.e.} a $71$\% success rate). Therefore, this computational method for searching genomic databases for potential transcription factor binding sites is a powerful tool for investigating mechanisms of differential gene regulation. This is joint work with Kyle Ellrott, Chuhu Yang, and Frances Sladek.

Short Bio: Tao Jiang received a B.S. in Computer Science and Technology from University of Science and Technology of China, Hefei, P.R.China in July 1984 and a Ph.D., in Computer Science, from University of Minnesota in Nov. 1988. He was a faculty member at McMaster University, Hamilton, Ontario, Canada during Jan. 1989 - July 2001. He joined University of California - Riverside as a Professor of Computer Science in Sept. 1999. Tao Jiang's recent research interest includes algorithms, computational molecular biology, computational complexity, and computational aspects of information gathering and retrieval. He has published actively in theoretical computer science and computational biology journals and conferences, and served on program committees for many international conferences as well as review panels for several funding agencies. He is presently serving on the editorial boards of International Journal of Foundations of Computer Science, Journal of Combinatorial Optimization, and Journal of Computer Science and Technology.

## Fast and Accurate Phylogeny Reconstruction Algorithms Based On the Minimum-Evolution Principle

Speaker:Richard Desper, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD, 20892, USA. email: desper@ncbi.nlm.nih.gov
Date: Tuesday, May 7, 2002
Time: 3:30 pm
Place: AP&M Building/Room 4301
Host: Steffen Heber

ABSTRACT
The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be statistically consistent when it is used in conjunction with ordinary least-squares (OLS) fitting of a metric to a tree structure. The traditional approach to using ME has been to start with the Neighbor Joining (NJ) topology for a given matrix, and then do a topological search from that starting point. The first stage requires O(n^3) time, where n is the number of taxa, while the current implementations of the second are in O(pn^3) or more, where p is the number of swaps performed by the program. In this talk, I'll describe a greedy approach to Minimum Evolution which produces a starting topology in O(n^2) time. Also, I'll demonstrate an algorithm that searches for the best topology using nearest neighbor interchanges (NNIs), where the cost of doing p NNIs is O(n^2+pn), i.e. O(n^2) in practice because p is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when used in combination with NNIs, produces trees which are fairly close to NJ trees in terms of topological accuracy. Our work also examines ME under a balanced weighting scheme, where sibling subtrees have equal weight, as opposed to the standard "unweighted" OLS, where all taxa have the same weight so that the weight of a subtree is equal to the number of its taxa. The balanced minimum evolution scheme (BME) runs slower than the OLS version, requiring O(n^2 * diam(T)) operations to build the starting tree and $O(pn* diam(T))$ to perform the NNIs, where diam(T) is the topological diameter of the output tree. In the usual Yule-Harding distribution on phylogenetic trees, the diameter expectation is in log(n), so our algorithms are in practice faster that NJ. Furthermore, this BME scheme yields a very significant improvement over NJ and other distance-based algorithms, especially with large trees, in terms of topological accuracy. I will present simulation results which demonstrate that the BME scheme produces output trees which are not only more accurate than those of NJ, but also than slower algorithms such as BioNJ and Weighbor.
This is joint work with: Olivier Gascuel D\'epartement Informatique Fondamentale et Applications, LIRMM, 161 rue Ada, 34392 Montpellier, France Email: gascuel@lirmm.fr

## Sparse Sequence Modeling Applied to Computational Biology and Intrusion Detection

Speaker:Eleazar Eskin
Date: Wednesday, April 17, 2002
Time: 11am - 12pm
Place: AP&M Building/Room 4301
Host: Pavel Pevzner

ABSTRACT
How do we determine what a protein does? What turns genes on and off? Does this system log correspond to a malicious intruder? On the surface, the answers to these three questions seem to have little in common. In fact, these three problems: classifying proteins into families, finding regulatory signals in DNA sequences, and detecting anomalies in audit streams for intrusion detection, can all be characterized as "sparse" sequence modeling problems and addressed in a common framework. In "sparse" sequence modeling problems, only a fraction of the elements of the sequence have relevance to the problem. This is the case in many practical applications, such as the search for regulatory patterns in DNA sequences, where the target of the search is a tiny fraction of the sequence. Similarly, in intrusion detection, typically the evidence that an audit stream from a system contains an attack is often buried in a vast amount of irrelevant information. Furthermore, modeling sparse sequences often requires allowing "softer" matches between a sequence and a canonical model such as allowing for mismatches. A DNA signal, for example "TATAAT", can often occur with several mismatches in any position, in this case as "TAAAAT" or "TATCAT". This presents some serious computational difficulties since there are an exponential number of models which can match a given sequence.

We present a new efficient framework for approaching sparse sequence modeling problems. The framework is grounded in theory and is applied by well engineered implementations to important practical applications. Specifically, we demonstrate this framework on three problems, classification of amino acid sequences into protein families, outlier detection over sequences of system calls for intrusion detection, and signal finding for discovering transcription factor binding sites in gene promoter regions. This framework employs efficient data structures which index the sequences to allow iterating over all possible sparse models of the sequence. We modify several learning algorithms including Boosting, Support Vector Machines, and a new set of outlier detection algorithms to take advantage of this iteration through possible models in our framework. While still considering as rich a set of models as the naive approaches we can avoid the intractable computational difficulties.

The signal finding method is available for public use as a webserver and the intrusion detection methods are currently being transferred into a live deployment.

## Fold recognition - old challenges, new ideas

Speaker: Adam Godzik, Bioinformatics Program, The Burnham Institute and Joint Center for Structural Genomics, UCSD
Date: April 9, Tuesday
Time: 4:00 pm - 5:15 pm
Place: AP&M Building - Room 4301
Host: Pavel Pevzner

ABSTRACT
One of the most interesting problems in sequence analysis is that many proteins with no apparent sequence similarity nevertheless display significant similarity in their structure and function. Are they homologous? Can we make any statements about their evolutionary relations when we are not able to discover any relation between their sequences? Or perhaps we can redefine sequence similarity so it would encompass such unusual pairs.

I present an overview and specific examples of two classes of approaches designed to do the latter. In one, the sequence is redefined as a vector of distributions (profile) and various possibilities of defining and then comparing such vectors lead to different super sensitive algorithms for sequence comparison. The second class of approaches attempts to describe a sequence-structure compatibility function and searches for sequences that would fit into a given structure based on energy criteria.

Despite somewhat shaky theoretical foundations of both groups of algorithms, they become extremely useful in predicting function of newly sequenced proteins and are now used extensively in genome analysis. I present several examples of successes and failures of distant homology recognition.

## Finding Motifs in the Twilight Zone

Speaker:Dr. Uri Keich, Departmment of Computer Science and Engineering, UCSD
Date: Wednesday, April 3, 2002
Time: 1:30 pm
Place: AP&M Building/Room 4301
Host: Pavel Pevzner

ABSTRACT
Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem.

In our talk we will briefly review the computational problem of motif finding and introduce a new motif finding algorithm that can detect very subtle motifs: relying on a generalization of positional profiles to multiprofiles, our MULTIPROFILER algorithm outperforms other leading motif finding algorithms in a number of synthetic models. Moreover, it can be shown that in some previously studied motif models, MULTIPROFILER is capable of pushing the performance envelope to its theoretical limits.

This is joint work with Pavel A. Pevzner.

## Combinatorial Approaches to Finding Motifs in DNA Sequences

Speaker:Dr. Sing-Hoi Sze, Departmment of Computer Science and Engineering, UCSD
Date: Tuesday, March, 26
Time: 3:00 pm
Place: AP&M Building/Room 4301
Host: Pavel Pevzner

ABSTRACT
Motif finding (pattern discovery in unaligned DNA sequences) is a fundamental problem in both computer science and molecular biology with important applications in locating regulatory sites and drug target identification. We describe two combinatorial approaches WINNOWER and SP-STAR which prove to give better performance than some of the most popular signal finding algorithms, including CONSENSUS, GibbsDNA and MEME, when applied to simulated samples with uniform background distribution. We benchmark these and other algorithms on several bacterial and archaeal sites in a setting specifically designed to imitate the situations arising in comparative genomics studies. We then describe a few collaborative projects with biologists to apply these motif finding algorithms to real biological problems. Finally, we describe an ongoing effort to extend SP-STAR to model dyad motifs which consist of two parts with varying distances between the parts.

## Analysis of Cancer Tissue by Complex DNA and Antibody Microarrays

Speaker: Dr. Jörg Hoheisel, Functional Genome Analysis, Deutsches Krebsforschungszentrum (DKFZ), Im Neuenheimer Feld 506, D-69120 Heidelberg, Germany.
Date: Wednesday, March 20
Time: 3:00 pm
Place: San Diego Supercomputer Center (SDSC)
Host: Pavel Pevzner

ABSTRACT
The Division of Functional Genome Analysis at the DKFZ is involved in the development of technologies for the analysis of large genomic areas to entire genomes with respect to the encoded functions and their regulation. Based on technical advances, various functional aspects are being analysed. One emphasis is work on DNA-, protein- and peptide-microarrays. Apart from addressing chemical and biophysical issues, the resulting methods are immediately put to the test in relevant, biologically driven projects on various organisms. Concerning analysis of human material, systems are being developed toward early diagnosis, prognosis and evaluation of the success of disease treatment with accentuation on cancer. Beside a variety of other applications, comparative studies on transcription levels, actual protein expression and epigenetic variation are under way.
URL: http://www.dkfz-heidelberg.de/funct_genome/

## Protein Sequence-Structure Space - New Insights from Combinatorial Extension (CE) Structure Alignment Algorithm

Speaker: Dr. Ilya Shindyalov San Diego Supercomputer Center, University of California, San Diego
Date: March 12, Tuesday
Time: 4:00 pm - 5:15 pm
Place: AP&M Building - Room 4301
Host: Pavel Pevzner

ABSTRACT
A study of sequence-structure space has been performed using the Combinatorial Extension (CE) algorithm for determining structural alignment and BLAST for determining sequence similarity. Significant clusters in sequence-structure space associated with recurrent structures (convergent evolution) and protein superfamilies (divergent evolution) have been described. These observations have been compared to the scop classification of protein domains that defines similar features. Various representations of the PDB as sequence and/or structure non-redundant set of protein chains have been defined. Also comparison between CE and Dali algorithms of the structural alignments is given. Principles for multiple structure alignment and a heuristical algorithm for multiple structure alignment using Monte Carlo approach is described.

## Bioinformatics Classification: Protein Function, Gene Expression and Cancer Subtyping

Speaker: Professor William Stafford Noble, Department of Computer Science, Columbia University
Date: Tuesday, February 19
Time: 4:00 pm - 5:15 pm
Place: AP&M Building - Room 4301
Host: Pavel Pevzner

ABSTRACT
The Computational Biology Laboratory at Columbia University operates at the interface of machine learning, bioinformatics and molecular biology. In this talk, I will describe three ongoing projects in our group.

The first project uses the support vector machine (SVM) learning algorithm to classify proteins by remote homology. A protein amino acid chain is converted into a fixed-length vector by computing a series of p-values from Smith-Waterman pairwise sequence comparisons. The resulting classifier recognizes remote protein homologs from the SCOP hierarchy more accurately than PSI-BLAST, SAM-T98 and the SVM-Fisher method.

The second project addresses a common question in the analysis of DNA microarray data: which classes of genes are most strongly affected by a given set of experimental manipulations? This question falls between unsupervised and supervised analyses, also known as class discovery and class prediction. We introduce three metrics for selecting interesting gene classes from a given library of classes, and we demonstrate their utility on gene expression data sets from yeast, mouse brain and acute leukemias.

In the final part of the talk, I will present preliminary results on the classification and diagnostic prediction of adult soft tissue sarcoma (STS) by gene expression profiling and SVM analysis. STS's are typically classified by phenotypic and genotypic criteria into more than 50 diagnostic categories. SVM analysis allows us to validate and refine these categories, as well as to define robust formal diagnostic classifiers. This project is a collaboration with scientists at the Memorial Sloan-Kettering Cancer Center.

Short Bio: William Stafford Noble (formerly William Noble Grundy) received the Ph.D. in computer science and cognitive science from the University of California, San Diego in 1998, where he studied with Charles Elkan. After one year as a Sloan/DOE Postdoctoral Fellow with David Haussler at the University of California, Santa Cruz, Noble joined the faculty of the Department of Computer Science at Columbia University as an Assistant Professor. He also holds a joint appointment at the Columbia Genome Center. Noble is the recipient of an NSF CAREER award and is a Sloan Research Fellow.

## Multiple Sequence Alignment Using Partial Order Graphs

Speaker: Professor Christopher Lee
Date: Tuesday, November 27th
Time: 3:30 pm - 5:15 pm
Place: AP&M Building - Room 4301
Host: Pavel Pevzner

ABSTRACT
Progressive multiple sequence alignment (MSA) methods depend on reducing an MSA to a linear profile for each alignment step. However, this leads to loss of information needed for accurate alignment, and gap scoring artifacts.

Results: We present a graph representation of an MSA that can itself be aligned directly by pairwise dynamic programming, eliminating the need to reduce the MSA to a profile. This enables our algorithm (POA: partial order alignment) to guarantee that the optimal alignment of each new sequence versus each sequence in the MSA will be considered. Moreover, this algorithm introduces a new edit operator, homologous recombination, important for multidomain sequences. The algorithm has improved speed (linear time complexity) over existing MSA algorithms, enabling construction of massive and complex alignments (e.g., an alignment of 5000 sequences in 4 hours on a Pentium II). We demonstrate the utility of this algorithm on a family of multidomain SH2 proteins, and on EST assemblies containing alternative splicing and polymorphism.

Bio: Dr. Lee is Assistant Professor of Chemistry & Biochemistry at UCLA. His research interests include bioinformatics, coding region polymorphism and alternative splicing. He is a member of the Board of Scientific Counselors for NCBI, and recently received the MIT Technology Review TR100 award. Further information is available at http://www.bioinformatics.ucla.edu/leelab

## Functional MRI of the Medial Temporal Lobe: Visualizing in Flat Space the Neural Components of Memory

Speaker: Dr. Michael Zeineh
Date: Friday, November 16, 2001
Time: 1:10 pm - 2:10 pm
Place: AP&M Building - Room 4301

ABSTRACT
Memory is essential to central nervous system operation. Under the rubric of memory falls the entire history of an organism: updating and accessing this history is vital to an organism's survival. Optimal memory functioning requires that information be quickly and accurately stored, updated, and retrieved. The role of the medial temporal lobe (MTL) in memory processing has been demonstrated by the serious defects in encoding and retrieving new memories in patients and animals with MTL lesions. The MTL consists of several anatomic substructures, namely the hippocampus, amygdala, and adjacent neocortex. Initially, lesion studies found that focal hippocampal lesions impair memory, and larger lesions that encompass adjacent neocortex simply lead to more significant memory impairments. The interpretation was that the hippocampus is the final point of convergence for all declarative memory formation. Recent lesion work, however, has implicated a more complex architecture: anatomically distinct regions may contribute to different components of memory.

Human neuroimaging is an excellent technique to test this hypothesis because the experimenter can query the subject directly and correlate measured brain activity with encoding, recognition, and recall performance. Neuroimaging is based on the phenomenon that blood flow increases to active brain regions. Subsequent modulation of venous blood oxygenation results in signal changes that can be measured using functional magnetic resonance imaging (fMRI). While neuroimaging studies in humans have examined the MTL during a variety of memory conditions, few have specifically measured the differential involvement of its subregions. Using high-resolution fMRI combined with cortical unfolding techniques, our group has examined the activity of cellular compartments within the MTL that contribute to memory formation.

This talk will 1) provide background on the neuroanatomy of memory, 2) describe how functional imaging can be used to measure cerebral activity, 3) detail how we have used cortical unfolding techniques to visual activity within the medial temporal lobe, and 4) demonstrate differential activity within medial temporal substructures observed in memory experiments.

## PatternHunter: Any Genome Anywhere

Speaker: Professor Ming Lin, University of California, Santa Barbara
Date: Friday, October 26, 1001
Time: 11:00 am to 12:00 pm
Place: AP&M Building - Room 4301
Host: Pavel Pevzner

ABSTRACT:
Genomics studies routinely depend on homology searches based on the strategy of finding short seed matches which are then extended. The fast genomic data growth presents a dilemma for Blast-type DNA homology search techniques: increasing seed size decreases sensitivity whereas decreasing seed size slows down computation. For example, in order to process longer sequences, MegaBlast uses seeds of length 28 instead of Blastn's 11. This limits MegaBlast's application to only highly similar sequences. Several other programs based on suffix trees also run at MegaBlast speed and suitable only for highly similar sequences (e.g. MUMmer, Quasar).

We introduce a novel idea for homology search that simultaneously increases sensitivity and search speed. We have implemented this idea in the portable Java program PatternHunter'' that finds homologies within one or between two DNA sequences. Together with many algorithmic improvements, PatternHunter simultaneously exceeds Blastn in sensitivity and alignment quality, MegaBlast in speed (on long sequences), and both in memory use. With a 700+Mhz cpu, its running time ranges from seconds for prokaryotic genomes to minutes for arabidopsis chromosomes to hours for human chromosomes. At sensitivity level equivalent to Blastn seed length 12, PatternHunter processes 9 gigabases of unassembled 3-coverage of mouse genome vs the human genome in 20 cpu-days, for the mouse genome project. Written in Java, it runs any genome anywhere.

Dr. Li is a Professor of Computer Science at University of California at Santa Barbara. His main research interest is bioinformatics. His other research interests include Kolmogorov complexity, machine learning, and analysis of algorithms. He is the author of the book M. Li, P. Vitanyi , "An introduction to Kolmogorov Complexity."

## Improved Shotgun Assembly Techniques for Large Genomes

Speakers: Brian R. Hunt and James A. Yorke, University of Maryland
Date: Tuesday, October 23, 2001
Time: 3:00 to 4:30 pm
Place: AP&M Building Room 4301
Host: Pavel Pevzner

ABSTRACT:
The whole-genome shotgun assembly technique has been remarkably successful in efforts to determine the sequence of base pairs that make up a genome. This technique has required long computation times and/or very large amounts of RAM when applied to large genomes. We describe an approach to shotgun assembly for large genomes such as the mouse, rat, or human that can run efficiently on a computer with 1 or 2 GB of RAM. We describe our methods and results for a crucial step in this direction: the creation of a list of pairs of reads that potentially overlap. A large percentage of the overlaps found by a standard seed-and-extend approach are spurious, due to repeat regions in the genome. Our approach is to reduce the percentage of spurious overlaps in the list to just a few percent before proceeding with assembly. For this task, we use a modified seed-and-extend approach which runs very quickly, followed by an iterative cleaning and error correcting procedure which uses quality values. We also describe methods currently under development to assemble a skeleton of mated read pairs spanning the genome. In particular, we introduce the notion of "chain validation", which further increases the accuracy of both overlap and mate information. We presents preliminary results for all of our techniques applied to a library of 1 million faux reads we created from the C. elegans genome using real quality values.

Brian Hunt is an Associate Professor of Mathematics and James Yorke is a Distinguished University Professor of Mathematics and Physics. Both are affiliated with the Institute for Physical Science and Technology, of which Yorke serves as Director.

## Finding Composite Regulatory Patterns in DNA Sequences

Speaker:  Eleazar Eskin, Department of Computer Science, Columbia University
Time:        3:00 pm, Wednesday , August, 29, 2001
Place:       Applied Physics and Math Building Room 4301
Host:         Pavel Pevzner

ABSTRACT: Pattern discovery in DNA sequences is a fundamental problem in computational biology with important applications in  finding regulatory signals. Current approaches to pattern discovery focus on monad patterns that correspond to relatively short contiguous strings. However, many of the actual regulatory signals are composite patterns that are groups of monad signals. A difficulty in discovering composite signals is that one of the component monad signal in the groups may be too weak". Since the traditional monad-based motif finding algorithms usually output one (or a few) high scoring patterns, they often fail to find composite regulatory signals consisting of weak monad parts. A better approach would be to output a long ranked list (i.e.,
thousands) of top-scoring patterns in a hope that the monad parts belong to this list. We present a new MITRA algorithm for discovering composite signals and several sets of experiments over biological and synthetic data.

## Genes, Permutations and the Common Interval Problem

Speaker: Steffen Heber
Time: 11:00 am, Thursday, May 31, 2001
Place: AP&M Room 4301
Host: Pavel Pevzner

The genome-wide prediction of protein function is an extremely challenging task. For every sequenced genome the role of about half the genes is unknown and most approaches to determine their function are time consuming and expensive.
The availability of a growing number of completely sequenced genomes opens new opportunities for detecting possible functional associations between genes. One approach [Overbeek99,Marcotte99,Snell2000] is based on the observation that genes that repeatedly occur clustered in multiple phylogenetically distant genomes tend to encode functionally interacting proteins, e.g. proteins of a protein complex or proteins of a metabolic pathway. If one models sequenced genomes by permutations of genes, the problem of finding co-occurring genes translates to the combinatorial
problem of finding common intervals in these permutations.
In this talk, we describe an efficient algorithm to solve this problem, and give several lines of future work concerning applications in comparative genomics which promise many interesting bioinformatical questions.

Short Bio: Steffen is a postdoc with Prof. Pevzner. Prior to joining UCSD he worked at the DKFZ (German Cancer Research Centre) in the groups of Prof. Vingron and Dr. Hoheisel.
URL: http:/www.cse.ucsd.edu/~sheber

## COMPUTATIONAL PROTEOMICS AND VIRTUAL LIGAND SCREENING

Speaker: RUBEN ABAGYAN, Department of Molecular Biology, The Scripps Research Institute
Time: 11:00 am, Thursday, May 24, 2001
Place: AP&M Room 4301
Host: Pavel Pevzner

Large scale discovery of new human genes, gene families and isozymes creates an exciting biomedical opportunity. A combination of high throughput efforts to accelerate and automate experimental determination of three-dimensional structures, improved modeling by homology techniques can be employed to generate 3D models for uncharacterized gene family members. Fast fold (or loop) predictions are performed with accurate Poisson electrostatics.

New computational technology for virtual ligand screening allows to generate alternative receptor conformations and perform flexible docking of hundreds of thousands of virtual compounds to the binding site. The success of this technology is demonstrated on several benchmarks and in five experimental projects which tested the prediction results. For well defined binding pockets only about 1% of the initial compound set actually needs to be synthesized and tested.

Short Bio: Dr. Abagyan is a professor at the Department of Molecular Biology, The Scripps Research Institute. Prior to joining Scripps he directed bioinformatics efforts at NYU.
URL: http://abagyan.scripps.edu/ruben.htm

## THE ROLE OF BIOINFORMATICS IN A MICROARRAY EXPERIMENT

Speaker: SEMYON KRUGLYAK, Illumina, San Diego
Time: 11:00 am, Thursday, May 17, 2001
Place: AP&M Room 4301
Host: Pavel Pevzner

The Illumina BeadArray technology is an exciting new platform in the emerging field of microarrays. Applications of microarray technology include SNP genotyping, gene xpression analysis, and proteomics. I will introduce the Illumina platform and describe applications to gene expression. The focus of the talk will be on the role of bioinformatics in an array experiment and the computational challenges that we encounter.

Short Bio: Dr. Kruglyak graduated from Cornell with Ph.D. in Applied Mathematics. His research interests cover DNA sequencing, microsatellite evolution and more recently, the analysis of gene expression data.
URL: http://www.illumina.com/

## ALGORITHMIC STUDY IN COMPUTATIONAL MASS-SPECTROMETRY

Speaker: TIM TING CHEN, University of Southern California
Time: 11:00 am, Thursday, May 3, 2001
Place: AP&M Room 4301
Host: Pavel Pevzner

In recent years, more and more genomes of model organisms have been sequenced. Using these genomic sequences, researchers have made tremendous progress in the study of the genome. Beyond all these successes is the far more challenging and rewarding task of understanding the proteome, the functions of genes and their protein products.

Tandem Mass Spectrometry (MS-MS), which identifies proteins/peptides from their fragmentation mass patterns, is emerging as a powerful technique for the studies of proteins. In this talk, we focus on three fundamental computational problems arisen in proteomic projects:
(1) the peptide identification problem, which finds a peptide sequence that is optimally correlated to a tandem mass spectrum, from a protein or a genomic database,
(2) the de novo peptide sequencing problem, which sequences peptides directly from tandem mass spectra without database search, and
(3) the identification of protein cross-linking sites, which finds a pair of cross-linked peptide sequences optimally correlated to a tandem mass spectrum.
The solution to all these problems requires novel and efficient algorithms, together with practical implementations and data structures.

Short Bio: Dr. Chen is an assistant professor at the Department of Mathematics, Department of Computer Science, and Center for Computational and Experimental Genomics at USC. He was a postdoc research fellow at George M. Church Laboratory at Harvard Medical School. He received his Ph.D. degree in Computer Science from the State University of New York at Stony Brook.
URL: http://hto-13.usc.edu/~tingchen/