SPAdes 3.0 on GAGE-B data sets

The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.

Since SPAdes 3.0 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.

The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 3.0 assembly of 500x-coverage data.

The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.

Four tables of results are presented below. For our tables, we used the format that was presented in the Supplementary Material of Magoc, et al., 2013. We used the QUality ASessment Tool (QUAST) to calculate the same metrics used in the GAGE-B paper. The GAGE-B paper used slightly different names than QUAST for some metrics; below, for each metric, we list the GAGE-B name, and indicate the QUAST name in brackets.

Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]
N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]
Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]
Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]
N50Corr, corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]
GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]
Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]
Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]

It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.3, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.

Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.

Click on the "Contigs" or "Scaffolds" links on the left side of each table to see the QUAST-generated web report.

Table 1. Assemblies of B. cereus (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	115	78	90	153	3335	105	53	404
	N50 (kb)	130.6	155.4	246.7	116.5	25.5	246.3	286.8	24.5
	Errors	2	5	9	9	17	0	1	3
	Errors-L	25	6	11	14	9	20	10	11
	N50Corr (kb)	130.6	150.5	246.7	100.0	25.5	246.3	286.8	24.5
	GenFrac (%)	98.6	99.3	99.2	99.2	98.9	98.3	98.8	97.8
	Unaligned	1	0	0	4	4	1	1	1
	Duplication	1.0	1.0	1.0	1.0	1.1	1.0	1.0	1.0

Scaffolds	Num	74	33	61	n/a	341	56	41	78
	N50 (kb)	135.6	431.5	337.9	n/a	25.5	456.6	775.7	247.7
	Errors	3	9	12	n/a	1	0	2	11
	Errors-L	29	13	13	n/a	1	39	11	258
	N50Corr (kb)	135.3	364.2	337.9	n/a	25.5	456.0	286.8	208.4
	GenFrac (%)	98.4	99.3	99.2	n/a	97.6	98.3	98.7	97.7
	Unaligned	0	0	0	n/a	0	1	0	1
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0

Table 2. Assemblies of R. sphaeroides (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	486	146	63	867	986	437	89	416
	N50 (kb)	21.4	31.5	130.7	15.8	9.1	33.5	551.2	24.0
	Errors	1	6	5	18	4	1	3	2
	Errors-L	3	3	4	6	3	11	5	9
	N50Corr (kb)	21.4	30.4	130.7	15.4	9.1	33.5	518.3	24.0
	GenFrac (%)	98.4	85.6	92.0	99.3	98.9	98.3	99.5	97.9
	Unaligned	0	0	1	0	3	19	48	1
	Duplication	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

Scaffolds	Num	382	131	52	n/a	733	185	39	143
	N50 (kb)	21.4	40.3	144.8	n/a	8.0	45.1	551.2	85.3
	Errors	1	6	5	n/a	0	4	2	19
	Errors-L	3	7	7	n/a	2	214	5	185
	N50Corr (kb)	21.4	36.1	144.8	n/a	8.0	45.0	518.3	85.0
	GenFrac (%)	97.8	85.6	91.9	n/a	88.2	98.2	99.6	97.6
	Unaligned	0	0	0	n/a	0	0	1	0
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0

Table 3. Assemblies of M. abscessus (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	210	857	326	1760	1117	113	890	279
	N50 (kb)	70.4	8.7	38.2	114.1	13.3	131.6	335.3	48.2
	Errors	2	122	70	2358	180	5	12	76
	Errors-L	2	5	2	35	4	19	6	3
	N50Corr (kb)	68.5	8.3	37.2	75.0	12.8	113.3	303.8	41.5
	GenFrac (%)	99.2	96.2	98.4	99.4	99.4	99.2	99.4	99.1
	Unaligned	11	5	1	78	8	2	844	52
	Duplication	1.0	1.0	1.1	1.2	1.0	1.0	1.0	1.0

Scaffolds	Num	147	847	324	n/a	664	79	404	154
	N50 (kb)	73.2	9.1	38.2	n/a	13.3	152.6	335.3	71.0
	Errors	2	131	70	n/a	6	5	12	120
	Errors-L	3	5	2	n/a	1	31	6	19
	N50Corr (kb)	70.1	8.5	37.2	n/a	12.8	147.2	303.8	46.0
	GenFrac (%)	98.9	96.2	98.4	n/a	99.1	99.1	99.4	99.0
	Unaligned	0	5	1	n/a	4	1	363	1
	Duplication	1.0	1.0	1.1	n/a	1.0	1.0	1.0	1.0

Table 4. Assemblies of V. cholerae (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	267	241	173	431	1726	244	1798	201
	N50 (kb)	60.5	32.8	76.1	112.9	27.3	71.4	355.7	92.0
	Errors	2	17	19	106	77	16	9	14
	Errors-L	0	7	3	12	3	35	7	2
	N50Corr (kb)	60.3	32.8	76.1	108.7	27.3	65.5	355.7	63.6
	GenFrac (%)	97.2	97.0	97.7	98.4	98.3	97.4	98.0	97.8
	Unaligned	2	1	0	21	6	5	1712	1
	Duplication	1.0	1.0	1.0	1.0	1.1	1.0	1.0	1.0

Scaffolds	Num	196	241	163	n/a	309	165	932	138
	N50 (kb)	60.5	32.8	76.1	n/a	27.3	91.9	355.7	110.0
	Errors	2	17	19	n/a	2	17	8	27
	Errors-L	0	7	3	n/a	1	70	6	8
	N50Corr (kb)	60.3	32.8	76.1	n/a	27.3	89.8	355.7	63.6
	GenFrac (%)	96.7	97.0	97.7	n/a	95.7	97.1	97.9	97.6
	Unaligned	1	1	0	n/a	0	2	874	1
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0