Skip to main content

SPAdes 3.0 on GAGE-B data sets

The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.

Since SPAdes 3.0 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.

The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 3.0 assembly of 500x-coverage data.

The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.

Four tables of results are presented below. For our tables, we used the format that was presented in the Supplementary Material of Magoc, et al., 2013. We used the QUality ASessment Tool (QUAST) to calculate the same metrics used in the GAGE-B paper. The GAGE-B paper used slightly different names than QUAST for some metrics; below, for each metric, we list the GAGE-B name, and indicate the QUAST name in brackets.
  1. Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]

  2. N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]

  3. Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]

  4. Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]

  5. N50Corr,  corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]

  6. GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]

  7. Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]

  8. Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]

It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.3, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.

Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.

Click on the "Contigs" or "Scaffolds" links on the left side of each table to see the QUAST-generated web report.

 

Table 1. Assemblies of B. cereus (download contigs, scaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 115 78 90 153 3335 105 53 404
  N50 (kb) 130.6 155.4 246.7 116.5 25.5 246.3 286.8 24.5
  Errors 2 5 9 9 17 0 1 3
  Errors-L 25 6 11 14 9 20 10 11
  N50Corr (kb) 130.6 150.5 246.7 100.0 25.5 246.3 286.8 24.5
  GenFrac (%) 98.6 99.3 99.2 99.2 98.9 98.3 98.8 97.8
  Unaligned 1 0 0 4 4 1 1 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 74 33 61 n/a 341 56 41 78
  N50 (kb) 135.6 431.5 337.9 n/a 25.5 456.6 775.7 247.7
  Errors 3 9 12 n/a 1 0 2 11
  Errors-L 29 13 13 n/a 1 39 11 258
  N50Corr (kb) 135.3 364.2 337.9 n/a 25.5 456.0 286.8 208.4
  GenFrac (%) 98.4 99.3 99.2 n/a 97.6 98.3 98.7 97.7
  Unaligned 0 0 0 n/a 0 1 0 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 2. Assemblies of R. sphaeroides (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 486 146 63 867 986 437 89 416
  N50 (kb) 21.4 31.5 130.7 15.8 9.1 33.5 551.2 24.0
  Errors 1 6 5 18 4 1 3 2
  Errors-L 3 3 4 6 3 11 5 9
  N50Corr (kb) 21.4 30.4 130.7 15.4 9.1 33.5 518.3 24.0
  GenFrac (%) 98.4 85.6 92.0 99.3 98.9 98.3 99.5 97.9
  Unaligned 0 0 1 0 3 19 48 1
  Duplication 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
                   
Scaffolds Num 382 131 52 n/a 733 185 39 143
  N50 (kb) 21.4 40.3 144.8 n/a 8.0 45.1 551.2 85.3
  Errors 1 6 5 n/a 0 4 2 19
  Errors-L 3 7 7 n/a 2 214 5 185
  N50Corr (kb) 21.4 36.1 144.8 n/a 8.0 45.0 518.3 85.0
  GenFrac (%) 97.8 85.6 91.9 n/a 88.2 98.2 99.6 97.6
  Unaligned 0 0 0 n/a 0 0 1 0
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 3. Assemblies of M. abscessus (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 210 857 326 1760 1117 113 890 279
  N50 (kb) 70.4 8.7 38.2 114.1 13.3 131.6 335.3 48.2
  Errors 2 122 70 2358 180 5 12 76
  Errors-L 2 5 2 35 4 19 6 3
  N50Corr (kb) 68.5 8.3 37.2 75.0 12.8 113.3 303.8 41.5
  GenFrac (%) 99.2 96.2 98.4 99.4 99.4 99.2 99.4 99.1
  Unaligned 11 5 1 78 8 2 844 52
  Duplication 1.0 1.0 1.1 1.2 1.0 1.0 1.0 1.0
                   
Scaffolds Num 147 847 324 n/a 664 79 404 154
  N50 (kb) 73.2 9.1 38.2 n/a 13.3 152.6 335.3 71.0
  Errors 2 131 70 n/a 6 5 12 120
  Errors-L 3 5 2 n/a 1 31 6 19
  N50Corr (kb) 70.1 8.5 37.2 n/a 12.8 147.2 303.8 46.0
  GenFrac (%) 98.9 96.2 98.4 n/a 99.1 99.1 99.4 99.0
  Unaligned 0 5 1 n/a 4 1 363 1
  Duplication 1.0 1.0 1.1 n/a 1.0 1.0 1.0 1.0

 

 

Table 4. Assemblies of V. cholerae (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 267 241 173 431 1726 244 1798 201
  N50 (kb) 60.5 32.8 76.1 112.9 27.3 71.4 355.7 92.0
  Errors 2 17 19 106 77 16 9 14
  Errors-L 0 7 3 12 3 35 7 2
  N50Corr (kb) 60.3 32.8 76.1 108.7 27.3 65.5 355.7 63.6
  GenFrac (%) 97.2 97.0 97.7 98.4 98.3 97.4 98.0 97.8
  Unaligned 2 1 0 21 6 5 1712 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 196 241 163 n/a 309 165 932 138
  N50 (kb) 60.5 32.8 76.1 n/a 27.3 91.9 355.7 110.0
  Errors 2 17 19 n/a 2 17 8 27
  Errors-L 0 7 3 n/a 1 70 6 8
  N50Corr (kb) 60.3 32.8 76.1 n/a 27.3 89.8 355.7 63.6
  GenFrac (%) 96.7 97.0 97.7 n/a 95.7 97.1 97.9 97.6
  Unaligned 1 1 0 n/a 0 2 874 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0