Skip to main content

SPAdes Genome Assembler

 

 

SPAdes 3.1 is out! 

Includes mate-pairs only assembly pipeline for high-quality data and other improvements.

See all changes in changelog

SPAdes Assembler

 

SPAdes manual with installation guide (ver 3.1)

dipSPAdes manual

Download SPAdes

Assembling long Illumina paired-end reads (2x150 and 2x250) application note

SPAdes on GAGE-B data sets benchmark

Benchmark for other data sets

Support e-mail: spades.support@bioinf.spbau.ru

 

 

 

 

For the benchmarks we used:

E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes. S. aureus USA300 FPR3757 (chromosome and three plasmids) reference length is 2917469 bp with 2622 annotated genes.

Only contigs of 500 bp and longer were taken in consideration. Tables were obtained using QUAST 2.3.

Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
Single-cell E. coli                  
A5 14399 745 101584 4441145 8 12.01 0.14 89.880 3444
ABySS 68534 179 178720 4345617 6 3.32 0.81 88.268 3704
CLC 32506 503 113285 4656964 2 5.53 0.91 92.291 3768
EULER-SR 26662 429 140518 4248713 19 10.87 19.40 84.898 3416
Ray 45448 361 210820 4379139 17 6.22 1.29 88.372 3636
SOAPdenovo 1540 1166 51517 2958144 1 1.87 0.11 57.672 1766
Velvet 22648 261 132865 3501984 2 2.19 1.20 73.765 3080
E+V-SC 32051 344 132865 4540286 2 2.33 0.68 91.744 3771
IDBA-UD contigs 98306 244 284464 4814043 8 5.09 0.25 95.210 4045
IDBA-UD scaffolds 109057 229 284464 4813609 8 5.14 0.72 95.199 4052
SPAdes3.1 contigs 109059 238 268493 4797090 1 3.29 0.45 94.936 4036
SPAdes3.1 scaffolds 110081 233 268493 4799481 1 4.02 0.64 94.959 4041
                   
Isolate E. coli                  
A5 43651 176 181690 4551797 0 0.40 0.09 98.017 4163
ABySS 106155 96 221861 4619631 2 3.77 0.39 98.974 4241
CLC 86964 112 221549 4550314 1 1.96 0.29 98.094 4205
EULER-SR 110153 100 221409 4574240 9 3.16 5.03 98.102 4192
Ray 86246 98 221942 4634429 2 2.14 0.09 96.903 4136
SOAPdenovo 49626 181 165487 4535469 0 0.15 0.09 97.696 4132
Velvet 82776 120 242032 4554702 3 2.57 0.33 98.175 4196
E+V-SC 54856 171 166115 4539639 0 1.30 0.11 97.795 4134
IDBA-UD contigs 106844 110 221687 4565529 3 3.40 0.28 98.331 4206
IDBA-UD scaffolds 133098 93 284363 4565454 4 4.08 0.59 98.355 4216
SPAdes3.1 contigs 133088 92 285414 4558035 0 2.26 0.35 98.137 4208
SPAdes3.1 scaffolds 133309 90 285414 4558337 0 2.68 0.40 98.156 4212
                   
                   
Single-cell S. aureus                  
A5 4829 937 41828 2770402 8 24.63 0.37 91.581 1815
ABySS 43173 185 175286 2899223 4 6.49 0.43 96.578 2456
EULER-SR 7247 750 66549 2988161 46 21.85 10.67 94.436 2009
Ray 62026 84 125177 2947717 13 2.29 0.96 92.936 2412
SOAPdenovo 510 1047 27317 1473402 0 1.32 0.29 46.717 595
Velvet 15656 347 67677 2746768 3 4.41 4.27 93.181 2274
E+V-SC 32296 215 107657 2932416 5 6.92 4.89 97.519 2478
IDBA-UD contigs 87549 114 175236 2996997 7 2.43 0.66 98.655 2568
IDBA-UD scaffolds 111392 99 210360 2996115 7 2.50 1.35 98.678 2574
SPAdes3.1 contigs 148260 99 284175 2996003 4 4.03 1.01 98.598 2579
SPAdes3.1 scaffolds 159252 97 429536 2996537 4 4.59 1.04 98.615 2579

A5 and CLC 3.22.55708 were run with default parameters.ABySS 1.3.5, EULER-SR 2.0.1, Ray 2.2.0, SOAPdenovo 2.04, Velvet 1.2.07, and E+V-SC were run with vertex size 55. IDBA-UD 1.1.0 was run in its default iterative mode.

The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli and S. aureus genomes covered filters out these issues (GF (%), Genome fraction (%) column).
The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. 
Misassemblies (MA) are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference.
Mismatch (substitution) error rate (MM) and number of indels (IND) per 100 kbp are measured in aligned regions of the contigs. 
In each column, the best assemblers by that criteria is indicated in bold.
 
SPAdes 3.1 hybrid assemblies benchmarking on Illumina + PacBio E. coli data sets.
Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
E. coli K-12 Illumina only                  
SPAdes 3.1 contigs 133088 92 285414 4558035 0 2.26 0.35 98.137 4208
E. coli K-12 Illumina + PacBio P4                  
SPAdes 3.1 contigs 4647797 5 4647797 4650744 0 (6*) 8.71 0.71 99.999 4322
SPAdes 3.1 scaffolds 4647797 5 4647797 4650744 0 (6*) 8.71 0.71 99.999 4322
* Misassemblies are not real and correspond to the difference with respect to the reference
 
For the benchmarks we used:
  • E. coli K-12 MG1655 Illumina standard isolate dataset outlined above
  • E. coli K-12 MG1655 PacBio RS II C2/P4 dataset available from PacBio DevNet
 
SPAdes 3.1 experimental IonTorrent benchmarking on E. coli data sets.
Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
E. coli DH10B (R17-67)                  
SPAdes 3.1 contigs 88612 100 268586 4473193 3 5.26 6.82 95.400 4123
SPAdes 3.1 scaffoldBB 92103 98 268586 4474050 4 5.28 6.82 95.418

4125

E. coli O157:H7 (BEA-1108)                  
SPAdes 3.1 contigs
146450
214
374893
5299034
1
14.64
3.08
94.717
N/A
SPAdes 3.1 scaffolds
146450
209
374893
5301354
4
16.60
3.21
94.765
N/A

For the benchmarks we used:

  • E. coli DH10B (R17-67) dataset sequenced on 318v2 chip and is available on IonCommunity
  • E. coli O157:H7 Sakai (EHEC) (BEA-1108) dataset sequenced on 314 chip with HiQ enzyme and is available on IonCommunity

 

Related publications

  • S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, A. Lapidus, A. D. Prjibelsky, A. Pyshkin, A. Sirotkin, Y. Sirotkin, R. Stepanauskas, J. S. McLean, R. Lasken, S. R. Clingenpeel, T. Woyke, G. Tesler, M. A. Alekseyev, and P. A. Pevzner. Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products. Journal of Computational Biology 20(10) (2013), 714-737. doi:10.1089/cmb.2013.0084

  • Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021

  • Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome AssemblyJournal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

  • Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome AssemblyLecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
  • Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencingBMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

  • Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), pp. 1072-1075. doi:10.1093/bioinformatics/btt086

  • Andrey D. Prjibelski, Irina Vasilinetc, Anton Bankevich, Alexey Gurevich, Tatiana Krivosheeva, Sergey Nurk, Son Pham, Anton Korobeynikov, Alla Lapidus and Pavel A. Pevzner. ExSPAnder: a universal repeat resolver for DNA fragment assembly. Bioinformatics (2014) 30 (12): i293-i301. doi: 10.1093/bioinformatics/btu266

 


 
“I'd like to thank you for the great job you are doing with SPAdes. It's a very useful software!”
Lionel Guy
Uppsala University, Sweden
 
“Thanks for your great SPAdes assembler, we have successfully assembled several cultured organims and your assembler always performed best compared to other assemblers when run on the PE- and/or MP MiSeq data we generally use.”
Dr. Harald R. Gruber-Vodicka
Symbiosis Group
Max Planck Institute of Marine Microbiology, Bremen, Germany
 
"I have used SPAdes to correct errors in my metatransciptome data and it has significantly improved the data quality. Thanks!"
Burak Avci
Department of Molecular Ecology
Max Planck Institute of Marine Microbiology, Bremen, Germany
 
“We are also getting good results with SPAdes for metagenomic samples, thanks to its effort to recover as much genomic sequence as it can.”
Amr Abouelleil
Bioinformatics Assembly Analyst at Broad Institute
 
“I have recently used SPAdes to assembly reads generated on an Illumina platform (2 x 250 bp). The assemblies look very good!”
Mark de Been
Department of Medical Microbiology
University Medical Center Utrecht (UMCU) The Netherlands

 

 

Acknowledgements

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.