The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.
This is a joint work with Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey Gurevich, Mikhail Dvorkin, Alexander Kulikov, Valery Lesin, Son Pham, Andrey Prjibelski, Alexey Pyshkin, Alexander Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max Alekseyev, and Pavel Pevzner.
I shall describe two rather unexpected phenomena in genome rearrangements
First, weighted genomic distance designed to bound the proportion of transpositions in rearrangement scenarios between two genomes does not actually achieve this goal.
Second, while the median score of three genomes can be approximated by the sum of their pairwise genomic distances (up to a constant factor), these two measures of evolutionary remoteness of genomes are not as well-correlated as one's intuition may suggest.
The breakage-fusion-bridge (BFB) mechanism was proposed over seven decades ago and is a source of genomic variability and gene amplification in cancer. Here we formally model and analyze the BFB mechanism, to our knowledge the first time this has been undertaken. We show that BFB can be modeled as successive inverted prefix duplications of a string. Using this model, we show that BFB can achieve a surprisingly broad range of amplification patterns. We find that a sequence of BFB operations can be found that nearly fits most patterns of copy number increases along a chromosome. We conclude that this limits the usefulness of methods like array CGH for detecting BFB.
One of the most cited articles in biology is a 1973 piece by Theodosius Dobzhansky in the The American Biology Teacher entitled "Nothing in biology makes sense except in the light of evolution." Since then, phylogenetic tools have seen fast increasing use throughout biological and biomedical research -- over 10,000 citations to such tools appear every year. Yet, in spite of the fame of the article and the spread of phylogenetic analysis, the use of methods grounded in evolutionary biology is not nearly as pervasive as it ought to be. We illustrate some of the potential of phylogenetic approaches through two projects carried out in our laboratory. The first, phylogenetic transfer of knowledge, leverages known phylogenetic relationships to improve inference of data about modern structures. We have successfully applied our ProPhyC model to the refinement of regulatory networks and are currently using it for the refinement and prediction of protein contact networks in protein complexes. The second is a two-level phylogenetic model and associated inference algorithm for the evolution of isoforms in alternative transcription. We have applied our TrEvoR tool to the entire ASPIC database of alternative transcripts, resulting in much enhanced accuracy in the classification of transcripts. Our contention is that these are but two of the numerous further applications of phylogenetic methods in the life sciences, applications that will require both modelling and algorithmic research.
The sum of the products of the k-word counts from each of two sequences (D2) has been used to test if the sequences have a significant similarity. Important for large-scale applications, the statistic can be rapidly computed. In this talk some known results will be summarized where superior statistics (D2*) are available. Local versions of the statistics are considered in this talk. Although local behaves well statistically, a naive linear comparison increases the time to quadratic. We designed a geometric framework for identifying pairs of sequence windows that maximize the dot product. The D2 and D2* measures are similar to cosine similarity but with the additional information using the norm of the word count vectors, rather than only the angle between them. Our framework discretizes the norms of vectors, and transforms our dot-product similarity into Euclidean distance, but at loss of accuracy. We bound the error our transformation introduces and relate this error to the efficiency of algorithms based on this transformation.
This is joint work with Fengzhu Sun, Ehsan Behmanghader, and Andrew D. Smith.
We present statistical evidence that, within the protein interaction networks of H. sapiens and D. melanogaster, conserved connected subgraphs with low conductance tend to have high functional coherence. We present efficient algorithms for finding such subgraphs (joint work with Luqman Hodgkinson).
A protein regulatory network can be described by an acyclic digraph in which each node corresponds to a protein that may be in either an active or an inactive state, depending on the states of the nodes immediately preceding it. Given the graph-theoretic structure of the network and examples of its input-output behavior under various perturbations of the states of some of its proteins, we use integer programming to describe the dependencies at the nodes by Boolean functions that come closest to producing the observed input-output behavior (joint work with Roded Sharan).
Given a set of coins with unknown heads probabilities we consider the problem of performing a sequence of coin tosses to efficiently identify the coin with highest heads probability. This is a simplified variant of the following problem in genetics: given a set of mutations that may be associated with a disease, select a sequence of experiments, each of which observes the state of a mutation in a case and a control, to efficiently identify the mutation most associated with the disease (joint work with Karthik Chandrasekaran).
Novel approaches to gene expression analysis seek differential co-expression patterns, wherein the level of co-expression of a particular set of genes differs markedly between disease and control samples. Such patterns can arise from a disease-related change in the regulatory mechanism governing that set of genes. Here we present a new method for detecting differentially co-expressed gene sets. We introduce a novel probabilistic score for differential correlation, and use it to detect pairs of modules whose intra-module correlation is consistently high but whose inter-module correlation differs markedly between disease and normal samples. The algorithm outperforms the state of the art methods in terms of significance and interpretability of the detected gene sets. Moreover, the discovered gene sets are enriched with disease-specific microRNA families. In a case study on Alzheimer's disease, the method dissected biological processes into functional sub-units that are differentially co-expressed, thereby revealing inner structures in disease regulatory networks.
Joint work with David Amar and Hershel Safer.
In this talk we will present several problems related to SNPs and haplotypes genetic variation. Linkage disequilibrium (LD) is the occurrence of some combinations of alleles or genetic markers in a population more often or less often than would be expected from a random formation of haplotypes from alleles based on their frequencies. Linkage disequilibrium measures are of fundamental importance for the analysis of the empirical patterns of genetic variation in populations of individuals, although more than half a century of statistical genetics did not reached a consensus on how to measure LD. A classic survey on the topic includes in the title "proceed with caution" as a warning of the difficulties associated with this apparently non-quantitative phenomenon. Various axioms were proposed in the literature but the development of measures simultaneously satisfying desirable axioms is very limited. We will discuss two axioms related to the "curse of the pairwise" and the "interpretability of intermediate values" especially of interest for measuring LD in the context of GWAS analysis. We will present our construction of a new LD measure consistent with both axioms and an algorithm for its computation. The new measure is a "combinatorial square-root" of the LD measure r2. We will discuss other axioms and challenges in making progress on discovering unifying principles for LD measures. Two other problems for haplotype phasing and tagging SNPs selection will be presented aiming also at consolidating and hopefully unifying collections of diverse methods.
Given two strings A and B from the DNA alphabet, the Levenshtein edit distance between A and B, LED(A,B), is defined to be the minimum number of single character insertions, deletions and replacements to transform A to B (equivalently B to A). If in addition to the single character edits, one is permitted to perform segmental (block) edits in the form of (i) moving a block from any location to another, (ii) copying a block to any location, and (iii) uncopying (i.e. deleting one of the two occurrences of) a block, the resulting "block edit distance", BED(A,B), captures much of our current understanding of the relation between individual genome sequences. If among two communicating parties, Alice (holding genome sequence A) and Bob (holding genome sequence B), Alice wants to compute B, then, theoretically, the total number of bits Bob needs to send to Alice is Õ(BED(A,B) log BED(A,B) [Cormode et al., SODA 2000]. Considering that between a typical donor genome B and a reference genome A, the number of single character differences are in the order of a few million and the number of structural (i.e. blockwise) differences are in the order of tens of thousands, it should be possible to communicate genomes by exchanging only a few million bytes! Yet, today, the most effective way of communicating genome sequence data involves physically exchanging hard disks.
In this talk we will try to explain the wide gap between theoretical expectations and the current reality in genome communication, as well as storage, and pose some theoretical and practical problems on the way to the Google-ization of genome search and analysis. We will also try to explore the extent our theoretical predictions for genome sequences hold for the RNA-Seq data. Finally we will briefly go through some of the recent developments in transcriptome sequence analysis, especially in the context of disease studies.
Proliferation of drug-resistant diseases raises the challenge of searching for new, more efficient antibiotics. However, sequencing peptide antibiotics, once a heroic effort, remains time-consuming and error-prone.
Many antibiotics such as daptomycin or vancomycin (the modern antibiotics of last resort), represent cyclic Non-Ribosomal Peptides (NRPs) that do not follow the central dogma "DNA produces RNA produces Protein". They are assembled by nonribosomal peptide synthetases that represent both the mRNA-free template and building machinery for the peptide biosynthesis.
Thus, NRPs are not directly inscribed in genomes and cannot be inferred with traditional DNA sequencing. NRPs are of great pharmacological importance as they have been optimized by evolution for chemical defense and communication. NRPs include antibiotics, antitumor agents, immunosuppressors, toxins, and many peptides with still unknown functions.
Most NRPs contain nonstandard amino acids that are notoriously difficult to sequence. Moreover, the dominant technique for sequencing antibiotics (NMR) requires large amounts (milligrams) of highly purified materials that, for most compounds, are nearly impossible to obtain. Therefore, there is a need for NRPs sequencing by tandem mass spectrometry from picograms of material. Since nearly all NRPs are produced as related analogs by the same microorganism, we develop a mass spectrometry approach for sequencing all related peptides at once (in difference from the existing approach that analyzes individual peptides). Our results suggest that instead of attempting to isolate and NMR-sequence the most abundant compound, one should acquire spectra of many related compounds and sequence all of them at once using mass spectrometry. We illustrate applications of this approach by sequencing new variants of antibiotics from Bacillus brevis as well as sequencing a previously unknown family of NRPs (named Reginamides) which are isolated from a bacterial strain that produce natural products with anti-asthma activities.This is a joint work with Hosein Mohimani and Pieter Dorrestein (UCSD).
I shall talk about several ongoing studies on the evolution of bacterial genomes. These include evolution of pan-genomes, genome rearrangements, homologous recombination, gene strand preference, operon structure and horizontal gene transfer. The projects are at different states of completion, and, while there already are concrete results, I shall attempt to present a research program rather that detailed specifics.
Recent completion and inspection of whole genome sequence and assembly for over fity species of mammals, from platypus to panda to human, offer the prospect of a better view of the patterns of change within genome organization across the mammalian radiations. With an international community we have worked to sequence and assemble a deep (14x) coverage of the genome of domestic cat. The annotation reveals a rich assemblage of genes, repeats, SNPs, Numt, STRs, ERVs and countess genomic acronyms now quantified in that species and compared to other mammalian genomes. The genomics resource in cats has already produced a number of gene function insights as well as an archaeological view of the history of cat domestication. In addition, my colleagues and I have created Genome-10K, an international consortium of scientist who have set a goal of gathering, sequencing, assembling, and annotating to high quality some 10,000 vertebrate genomes with 2nd and 3rd generation sequencing technology within the coming five years. These activities provide an enormous BioInformatics challenge whose solution will provide future zoologists of every persuasion a genome sequence resource for their favourite study animal. The applications and potential for the genome sequence in several research questions will be discussed.