Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner.
SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing.
Journal of Computational Biology. May 2012, 19(5): 455-477. doi:10.1089/cmb.2012.0021.
Abstract
The lion's share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A major goal of single-cell genomics is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of single-cell data is challenging because of highly non-uniform read coverage as well as elevated levels of sequencing errors and chimeric reads. We describe SPAdes, a new assembler for both single-cell and standard (multicell) assembly, and demonstrate that it improves on the recently released E+V-SC assembler (specialized for single-cell data) and on popular assemblers Velvet and SoapDeNovo (for multicell data). SPAdes generates single-cell assemblies, providing information about genomes of uncultivatable bacteria that vastly exceeds what may be obtained via traditional metagenomics studies.
Datasets
Single E. coli cell and a single marine cell (Deltaproteobacterum SAR324) were isolated by micromanipulation. Paired-end libraries were generated on an Illumina Genome Analyzer IIx from MDA-amplified single-cell DNA and from standard (multicell) genomic DNA prepared from cultured E. coli. We call these datasets ECOLI-SC, ECOLI-MC, and SAR324. They consist of 100 bp paired-end reads with average insert sizes 266 bp for ECOLI-SC, 215 bp for ECOLI-MC, and 240 bp for SAR324. Both E. coli datasets have 600x coverage.
Results
We benchmarked seven assemblers (
EULER-SR,
IDBA,
SOAPdenovo,
Velvet,
Velvet-SC,
E+V-SC, and
SPAdes) on three datasets (
ECOLI-SC,
ECOLI-MC, and
SAR324). To provide unbiased benchmarking, we used the assembly evaluation tool Plantagora (
http://www.plantagora.org).
Table 1 illustrates that SPAdes compares well to other assemblers on multicell and, particularly, single-cell datasets. SPAdes assembled ~96.1% of the E. coli genome from the ECOLI-SC dataset, with an N50 of 49623 bp and a single misassembly. E+V-SC assembled ~93.8% of the E. coli genome with an N50 of 32051 and two misassemblies. SPAdes captured ~100 more E. coli genes than E+V-SC, ~800 more than Velvet, and ~900 more than SOAPdenovo.
On the ECOLI-MC dataset, the EULER-SR assembly featured the largest N50 (110,153 bp) but was compromised by 10 misassemblies. All other assemblers generated a small number of misassembled contigs, ranging from 4 (IDBA and Velvet) to 0 (Velvet-SC, E+V-SC, and SPAdes-single reads). SPAdes and Velvet also had larger N50 (86,590 and 78,602 bp) than other assemblers except for EULER-SR. All assemblers but SOAPdenovo produced nearly 100% coverage of the genome. Table 1 reveals that the substitution error rate ranges over an order of magnitude for different assemblers, with Velvet (for ECOLI-SC) and SPAdes-single reads (for ECOLI-MC) the most accurate.
We further compared E+V-SC and SPAdes on the SAR324 dataset. SPAdes assembled contigs totaling 5,129,304 bp (vs. 4,255,983 bp for E+V-SC) and an N50 of 75,366 bp (as compared to 30,293 bp for E+V-SC). Since the complete genome of Deltaproteobacterium SAR324 is unknown, we used long ORFs to estimate the number of genes longer than 600 bp, as a proxy for assembly quality. There are 2603 long ORFs in the SPAdes assembly vs. 2377 for E+V-SC.
This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.