Table of Contents

Preface


Acknowledgements


Introduction



Part I: GENOMES


1. Vineet Bafna, UCSD

Identifying the Genetic Basis of Disease

It is all in the DNA. Our genetic code, or genotype influences much about us. Not only are physical attributes (appearance, height, weight, eye color, hair color, etc.) all fair game for genetics, but also possibly more important things like our susceptibility to diseases, response to a certain drug, and so on. We refer to these "observable physico-chemical traits" as phenotypes. Note that "to influence" is not the same as "to determine" - other factors such as the environment one grows up in can play a role. The exact contribution of the genotype in determining a specific phenotype is a subject of much research. The best we can do today is to measure correlations between the two. Even this simpler problem has many challenges.

2. Kun-Mao Chao, National Taiwan University

Pattern Identification in a Haplotype Block

A Single Nucleotide Polymorphism (SNP, pronounced snip) is a single nucleotide variation in the genome that recurs in a significant proportion of the population of a species. In recent years, the patterns of Linkage Disequilibrium (LD) observed in the human population reveal a block-like structure. The entire chromosome can be partitioned into high LD regions, referred to as haplotype blocks, interspersed by low LD regions, referred to as recombination hotspots. Within a haplotype block, there is little or no recombination and the SNPs are highly correlated. Consequently, a small subset of SNPS, called tag SNPs, is sufficient to distinguish the haplotype patterns of the block. Using tag SNPs for association studies can greatly reduce the geno-typing cost since it does not require genotyping all SNPs. We illustrate how to recast the tag SNP selection problem as the set-covering problem and the integer-programming problem.two well-known optimization problems in computer science. Greedy algorithms and LP-relaxation techniques are then employed to tackle such optimization problems. We conclude this chapter by mentioning a few extensions.

3. Phillip Compeau and Pavel Pevzner, UCSD

Genome Reconstruction: A Puzzle with a Billion Pieces

While we can read a book one letter at a time, biologists still lack the ability to read a DNA sequence one nucleotide at a time. Instead, they can identify short fragments (approximately 100 nucleotides long) called reads; however, they do not know where these reads are located within the genome. Thus, assembling a genome from reads is like putting together a giant puzzle with a billion pieces, a formidable mathematical problem. We introduce some of the fascinating history underlying both the mathematical and the biological sides of DNA sequencing.

4. Mikhail Gelfand, Moscow State University, Russia

Dynamic Programming: One Algorithmic Key for Many Biological Locks

A major part of computational biology deals with the similarity of sequences, be they DNA fragments or proteins. There are four aspects to this problem: defining the measure of similarity, calculating this measure for given sequences, assessing its statistical significance, and interpreting the results from the biological viewpoint. Biologists are interested in the latter: similar sequences may have a common origin, as well as similar structure and function. However, here we shall deal with a formal problem: how to discover similarity.

5. Christopher Lee, UCLA

Measuring Evidence: Who's Your Daddy?

Single Nucleotide Polymorphisms (SNPs) are widely used as a genetic "fingerprint" for forensic tests and other genetic screening. For example, they can be used to measure evidence for paternity. To understand how scientists measure the strength of such evidence, we introduce basic principles of statistical inference using Bayes' Law, and apply them to simple genetics examples and the more challenging case of paternity testing. But first, just to make it personal, Maury and I have a little revelation for you...


Part II: GENE TRANSCRIPTION AND REGULATION


6. Andrei Grigoriev, Rutgers University

How Do Replication and Transcription Change Genomes?

From an evolutionary standpoint, DNA replication and transcription are two fundamental processes enabling reliable passage of fitness advantages through generations (in DNA form) and the manifestation of these advantages (in RNA form), respectively. Paradoxically, both of these basic mechanisms not only preserve genetic information but also apparently cause systematic genomic changes directly. Here, I show how genome-scale sequence analysis can help identify such effects, estimate their relative contributions, and find practical application (e.g., for predicting replication origins). Visualization of bioinformatics results is often the best way of connecting them to the underlying biological question, and I describe the process of choosing the visual representation that would help compare different organisms, genomes and chromosomes.

7. Sridhar Hannenhalli, University of Maryland

Modeling Regulatory Motifs

All biological processes are mediated by specific interactions between cellular molecules (DNA, RNA, proteins, etc.). The molecular identification marks, or signature, required for precise and specific interactions between various biomolecules is not always clear, a comprehensive knowledge of which is critical not only for a mechanistic understanding of these interactions but also for therapeutic interventions of these processes. The biological problem we will address here, stated in general terms, is: how do biomolecules accurately identify their binding partners in an extremely crowded cellular environment? An important class of cellular interactions concerns the recognition of specific DNA sites by various DNA binding proteins, e.g. transcription factors (TF). Precisely how the TFs recognize their DNA binding sites with high fidelity is an active area of research. While a detailed treatment of this question covers several areas of investigation, we will focus on aspects of TF-DNA recognition signal that is encoded in the DNA binding site itself. In this chapter we will summarize a number of approaches to model DNA sequence signatures recognized by transcription factor proteins.

8. Haixu Tang, Indiana University

How Does Influenza Virus Jump from Animals to Humans?

As shown by the 2009 Swine Flu outbreak, influenza epidemics are often caused by human-adapted influenza viruses originally infecting other animals. Influenza viruses infect host cells through the specific interaction between the viral hemagglutinin protein and the sugar molecules attached to the host cell membrane (called glycans). The molecular mechanism of the host switch for Avian influenza viruses was thus believed to be related to the mutations occurring in the viral hemagglutinin protein that change its binding specificity from the avian-specific glycans to the human-specific glycans. This theory, however, is not fully consistent with the epidemic observations of several influenza strains. I will introduce the bioinformatics approaches to the analysis of glycan array experiments that revealed the glycan structural pattern recognized by the hemagglutinin from viruses with different host specificities. The glycan motif finding algorithm adopted here is an extension of the commonly used protein/DNA sequence motif finding algorithms, which works for trees (representing glycan structures) rather than strings (as protein or DNA sequences).


Part III: EVOLUTION


9. Steffen Heber, North Carolina State University (with Brian Howard)

Genome Rearrangements

Genome rearrangements are one of the driving forces of evolution, and they are key events in the development of many diseases, including cancer. In our presentation, we focus on a selection of topics that provides a compelling introduction to key aspects of genome rearrangements and the algorithms that have been developed for their analysis. We demonstrate the intimate connection between biological data, mathematical modeling, and the design of efficient and practical computer algorithms for the genome rearrangement problem. Interestingly, the computational complexity of rearrangement algorithms is very different depending on how exactly the problem is formulated. After reading this chapter, you will be able to rearrange English words, giant chromosomes, and entire genomes. With some additional effort, you may even be able to rearrange the Conclusion Section of our chapter into the abstract you are reading right now.

10. Eugene Koonin (with Pere Puigbò and Yuri Wolf)

Comparison of Phylogenetic Trees and Search For a Central Trend in the "Forest of Life"

The widespread exchange of genes among prokaryotes, known as horizontal gene transfer (HGT), is often considered to "uproot" the Tree of Life (TOL). Indeed, it is by now fully clear that genes in general possess different evolutionary histories. However, the possibility remains that the TOL concept can be reformulated and remain valid as a statistical central trend in the phylogenetic "Forest of Life" (FOL). This article describes a computational pipeline developed to chart the FOL by comparative analysis of thousands of phylogenetic trees. This analysis reveals a distinct, consistent phylogenentic signal that is particularly strong among the Nearly Universal Trees (NUTs), which correspond to genes represented in all or most of the analyzed organisms. Despite the substantial amount of apparent HGT seen even among the NUTs, these gene transfers appear to be distributed randomly and do not obscure the central tree-like trend.

11. Jian Ma, University of Illinois at Urbana-Champaign

Reconstructing the History of Large-Scale Genomic Changes:
Biological Questions and Computational Challenges

In addition to point mutations, larger scale structural changes (including rearrangements, duplications, insertions and deletions) are also prevalent between different mammalian genomes. Capturing these large scale changes is critical to unraveling the history of mammalian evolution in order to better understand the human genome. It also has profound biomedical significance, because many human diseases are associated with structural genomic aberrations. The increasing number of mammalian genomes being sequenced as well as the recent advancement in DNA sequencing technologies are allowing us to identify these structural genomic changes with vastly greater accuracy. However, there are a considerable number of computational challenges related to these problems. In this paper, we introduce the ancestral reconstruction problem, which enables us to explain the large scale genomic changes between species in an evolutionary context. The application of these methods to within-species structural variation and disease genome analysis is also discussed. The target audience of this paper is advanced undergraduate students in biology.


Part IV: PHYLOGENY


12. Ran Libeskind-Hadas, Harvey Mudd College

Figs, Wasps, Gophers, and Lice: A Computational Exploration of Coevolution

This chapter explores the topic of coevolution: The genetic change in one species in response to the change in another. For example, in some cases a parasite species might evolve to specialize with its host species. In other cases, the relationship between two species may be mutually beneficial and coevolution may serve to strengthen the benefits of that relationship. One important way to study the coevolution of species is through a computational technique called cophylogeny reconstruction. In this technique, we first obtain the evolutionary (phylogenetic) trees for the two species and then try to map one tree onto the other in the "simplest" (most parsimonious) possible way. We can then use these mappings to determine how likely it is that the two species coevolved. This chapter begins with descriptions of several pairs of species that are believed to have coevolved: figs and the wasps that pollinate them, gophers and the lice that infest them, and a bird species that "tricks" another species to tend to its young. Next, we describe the cophylogeny reconstruction problem, its computational complexity, and a technique for finding good solutions for this problem. Finally, the reader is invited to use this computational method - through a freely accessible software package called Jane - to investigate the relationships between the pairs of species described at the beginning of the chapter.

13. Tiffani Williams, Texas A&M (with Seung-Jin Sul)

Big Cat Phylogenies, Consensus Trees, and Computational Thinking

Phylogenetics seeks to deduce the pattern of relatedness between organisms by using a phylogeny or evolutionary tree. For a given set of organisms or taxa, there may be many evolutionary trees depicting how these organisms evolved from a common ancestor. As a result, consensus trees are a popular approach for summarizing the shared evolutionary relationships in a group of trees. We examine these consensus techniques by studying how the pantherine lineage of cats (clouded leopard, jaguar, leopard, lion, snow leopard, and tiger) evolved, which is hotly debated. While there are many phylogenetic resources that describe consensus trees, there is very little information regarding the underlying computational techniques (such as sorting numbers, hashing functions, and traversing trees) for building them written for biologists. The pantherine cats provides us with a small, relevant example to explore the computational techniques (such as sorting numbers, hashing functions, and traversing trees) for constructing consensus trees. Our hope is that life scientists enjoy peeking under the computational hood of consensus tree construction and share their positive experiences with others in their community.

14. Tandy Warnow, UT Austin

Phylogenetic Estimation: Optimization Problems, Heuristics, and Performance Analysis

Phylogenetic trees, also known as evolutionary trees, are fundamental to many problems in biological and biomedical research, including protein structure and function estimation, drug design, estimating the origins of mankind, etc. However, the estimation of a phylogeny is enormously challenging from a computational standpoint, often involving months or more of computer time in order to produce estimates of evolutionary histories. Even these month-long analyses are not guaranteed to produce accurate estimates of evolution, for a variety of reasons. In addition to the errors in phylogeny estimation produced by limited amounts of data, there is the added.and critically important.fact that all the best phylogeny estimation methods are based upon heuristics for optimization problems that are difficult to solve. Consequently, large data sets are often "solved" only approximately. In this chapter, we discuss the issues involved in phylogeny estimation, as well as the technical term in computer science, "NP-Hard".


Part V: REGULATORY NETWORKS


15. Nataša Pržulj, Imperial College, London

Biological Networks Uncover Evolution, Disease, and Gene Functions

Networks have been used to model many biological systems. Hence, graph theory is a commonly used tool in systems biology. This chapter presents the basics of graph-theoretic analysis and modeling of biological data. In particular, first it describes biological network data along with sources of noise and biases in them, then it introduces tools for network comparison and modeling, talks about the most commonly used network models, and how to use network topology to uncover biological function from these complex network data. Finally, it presents the basics of network alignment algorithms.

16. Russell Schwartz, Carnegie Mellon University

Regulatory Network Inference

The functions of complex biological systems can often be understood by analysis of the interactions of their components through networks of regulation. Attempts to understand biology at the systems level therefore depend on methods for inferring complicated networks of interactions, often from only indirect evidence of how the components of the system behave under different conditions. In this chapter, we cover some of the major computational concepts involved in inferring regulatory networks by examining the specific problem of learning genetic regulatory networks from gene expression data. We first provide some background on the biological and technological issues involved in this topic. We then explore how one can abstract the problem of regulatory network inference so as to create a precise mathematical framework for solving the problem, starting with a highly simplified variant of the problem and gradually working towards a more realistic picture of network inference in practice. In the process, we explore some of the key concepts involved in formulating and learning computational models of complex systems from large, noisy data sets that are used throughout systems biology.