Invited talks (video)

SPAdes: a New Genome Assembly Algorithm and its Applications to Single-Cell Sequencing

Sergey Nikolenko
(St. Petersburg Academic University)

The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.

This is a joint work with Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey Gurevich, Mikhail Dvorkin, Alexander Kulikov, Valery Lesin, Son Pham, Andrey Prjibelski, Alexey Pyshkin, Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max Alekseyev, and Pavel Pevzner.

Genome rearrangements: when intuition fails

Max Alekseyev
(University of South Carolina)

I shall describe two rather unexpected phenomena in genome rearrangements analysis.
First, weighted genomic distance designed to bound the proportion of transpositions in rearrangement scenarios between two genomes does not actually achieve this goal.
Second, while the median score of three genomes can be approximated by the sum of their pairwise genomic distances (up to a constant factor), these two measures of evolutionary remoteness of genomes are not as well-correlated as one's intuition may suggest.

The combinatorics of the breakage fusion bridge and its application to cancer genomics

Vineet Bafna (University of California, San Diego)

The breakage-fusion-bridge (BFB) mechanism was proposed over seven decades ago and is a source of genomic variability and gene amplification in cancer. Here we formally model and analyze the BFB mechanism, to our knowledge the first time this has been undertaken. We show that BFB can be modeled as successive inverted prefix duplications of a string. Using this model, we show that BFB can achieve a surprisingly broad range of amplification patterns. We find that a sequence of BFB operations can be found that nearly fits most patterns of copy number increases along a chromosome. We conclude that this limits the usefulness of methods like array CGH for detecting BFB.

Following Dobzhansky: Extending The Reach of Phylogenies

Bernard Moret
(École Polytechnique Fédérale de Lausanne)

One of the most cited articles in biology is a 1973 piece by Theodosius Dobzhansky in the The American Biology Teacher entitled "Nothing in biology makes sense except in the light of evolution." Since then, phylogenetic tools have seen fast increasing use throughout biological and biomedical research -- over 10,000 citations to such tools appear every year. Yet, in spite of the fame of the article and the spread of phylogenetic analysis, the use of methods grounded in evolutionary biology is not nearly as pervasive as it ought to be. We illustrate some of the potential of phylogenetic approaches through two projects carried out in our laboratory. The first, phylogenetic transfer of knowledge, leverages known phylogenetic relationships to improve inference of data about modern structures. We have successfully applied our ProPhyC model to the refinement of regulatory networks and are currently using it for the refinement and prediction of protein contact networks in protein complexes. The second is a two-level phylogenetic model and associated inference algorithm for the evolution of isoforms in alternative transcription. We have applied our TrEvoR tool to the entire ASPIC database of alternative transcripts, resulting in much enhanced accuracy in the classification of transcripts. Our contention is that these are but two of the numerous further applications of phylogenetic methods in the life sciences, applications that will require both modelling and algorithmic research.

Alignment-Free Local Sequence Comparison: Statistics and Algorithms

Michael Waterman
(University of Southern California)

The sum of the products of the k-word counts from each of two sequences (D2) has been used to test if the sequences have a significant similarity. Important for large-scale applications, the statistic can be rapidly computed. In this talk some known results will be summarized where superior statistics (D2*) are available. Local versions of the statistics are considered in this talk. Although local behaves well statistically, a naive linear comparison increases the time to quadratic. We designed a geometric framework for identifying pairs of sequence windows that maximize the dot product. The D2 and D2* measures are similar to cosine similarity but with the additional information using the norm of the word count vectors, rather than only the angle between them. Our framework discretizes the norms of vectors, and transforms our dot-product similarity into Euclidean distance, but at loss of accuracy. We bound the error our transformation introduces and relate this error to the efficiency of algorithms based on this transformation.

This is joint work with Fengzhu Sun, Ehsan Behmanghader, and Andrew D. Smith.

Three Optimization Problems in Molecular Biology and Genetics

Richard Karp
(University of California, Berkeley)

We present statistical evidence that, within the protein interaction networks of H. sapiens and D. melanogaster, conserved connected subgraphs with low conductance tend to have high functional coherence. We present efficient algorithms for finding such subgraphs (joint work with Luqman Hodgkinson).

A protein regulatory network can be described by an acyclic digraph in which each node corresponds to a protein that may be in either an active or an inactive state, depending on the states of the nodes immediately preceding it. Given the graph-theoretic structure of the network and examples of its input-output behavior under various perturbations of the states of some of its proteins, we use integer programming to describe the dependencies at the nodes by Boolean functions that come closest to producing the observed input-output behavior (joint work with Roded Sharan).

Given a set of coins with unknown heads probabilities we consider the problem of performing a sequence of coin tosses to efficiently identify the coin with highest heads probability. This is a simplified variant of the following problem in genetics: given a set of mutations that may be associated with a disease, select a sequence of experiments, each of which observes the state of a mutation in a case and a control, to efficiently identify the mutation most associated with the disease (joint work with Karthik Chandrasekaran).