The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.
Since SPAdes 3.0 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.
The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 3.0 assembly of 500x-coverage data.
The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.
Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]
N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]
Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]
Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]
N50Corr, corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]
GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]
Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]
Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]
It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.3, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.
Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.