Public

QUAST 2.3 released

Long-awaited contig alignment plots (see an example below), updated misassemblies detection logic, full report in PDF format, and many other features included!

See Changes for for a full list of new features and fixed bugs.

See new version of Manual including new options and reports descriptions and FAQ section.

All other news and useful links are presented on QUAST page.

You can download QUAST 2.3 and previous versions here.

Clone of SPAdes Genome Assembler (version 20.01.2014)

SPAdes 3.0 is out!

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features!

See all changes in changelog.

SPAdes Assembler

SPAdes manual with installation guide (ver 3.0)

dipSPAdes manual

Download SPAdes

Assembling long Illumina paired-end reads (2x150 and 2x250) application note

SPAdes on GAGE-B data sets benchmark

Benchmark for other data sets

Support e-mail: spades.support@bioinf.spbau.ru

Follow @spadesassembler

For the benchmarks we used:

MDA single-cell E. coli; 6.3 Gb, 29M reads, 2x100bp, insert size ~ 270bp (Illumina Genome Analyzer IIx)
Standard isolate E. coli; 6.2Gb, 28M reads, 2x100bp, insert size ~ 215bp (Illumina Genome Analyzer IIx)
MDA single-cell S. aureus; 14.6Gb, 33M reads, 2x100bp, insert size ~ 214bp (Illumina Genome Analyzer IIx)

E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes. S. aureus USA300 FPR3757 (chromosome and three plasmids) reference length is 2917469 bp with 2622 annotated genes.

Only contigs of 500 bp and longer were taken in consideration. Tables were obtained using QUAST 2.3.

Assembly	NG50	# contigs	Largest	Total length	MA	MM	IND	GF (%)	# genes
Single-cell E. coli
A5	14399	745	101584	4441145	8	12.01	0.17	89.880	3444
ABySS	68534	179	178720	4345617	6	3.32	1.68	88.268	3704
CLC	32506	503	113285	4656964	2	5.53	1.42	92.291	3768
EULER-SR	26662	429	140518	4248713	17	10.87	35.67	84.898	3416
Ray	45448	361	210820	4379139	17	6.29	2.83	88.372	3636
SOAPdenovo	1540	1166	51517	2958144	1	1.87	0.11	57.672	1766
Velvet	22648	261	132865	3501984	2	2.19	1.23	73.765	3080
E+V-SC	32051	344	132865	4540286	2	2.35	0.73	91.744	3771
IDBA-UD contigs	98306	244	284464	4814043	8	5.09	0.27	95.210	4045
IDBA-UD scaffolds	109057	229	284464	4813609	8	5.14	0.77	95.199	4052
SPAdes2.5 contigs	110081	240	268493	4797724	1	3.52	0.64	94.926	4037
SPAdes2.5 scaffolds	112393	234	268493	4799671	1	4.36	0.79	94.948	4042

Isolate E. coli
A5	43651	176	181690	4551797	0	0.40	0.11	98.017	4163
ABySS	106155	96	221861	4619631	2	3.77	0.41	98.974	4241
CLC	86964	112	221549	4550314	1	1.96	0.33	98.094	4205
EULER-SR	110153	100	221409	4574240	8	3.16	10.33	98.102	4192
Ray	86246	98	221942	4634429	2	2.14	0.09	96.903	4136
SOAPdenovo	49626	181	165487	4535469	0	0.15	0.11	97.696	4132
Velvet	82776	120	242032	4554702	3	2.57	0.37	98.175	4196
E+V-SC	54856	171	166115	4539639	0	1.30	0.15	97.795	4134
IDBA-UD contigs	106844	110	221687	4565529	3	3.40	0.31	98.331	4206
IDBA-UD scaffolds	133098	93	284363	4565454	4	4.08	0.61	98.355	4216
SPAdes2.5 contigs	133088	92	285414	4558033	0	2.17	0.33	98.137	4208
SPAdes2.5 scaffolds	133309	90	285414	4558337	0	2.59	0.42	98.156	4212


Single-cell S. aureus
A5	4829	937	41828	2770402	9	24.63	0.37	91.581	1815
ABySS	43173	185	175286	2899223	4	6.49	0.46	96.578	2456
EULER-SR	7247	750	66549	2988161	42	21.85	13.76	94.395	2008
Ray	62026	84	125177	2947717	13	2.29	0.96	92.936	2412
SOAPdenovo	510	1047	27317	1473402	0	1.32	0.29	46.717	595
Velvet	15656	347	67677	2746768	3	4.41	4.49	93.181	2274
E+V-SC	32296	215	107657	2932416	6	6.92	5.03	97.437	2477
IDBA-UD contigs	87549	114	175236	2996997	7	2.43	0.66	98.583	2567
IDBA-UD scaffolds	111392	99	210360	2996115	7	2.50	1.35	98.606	2573
SPAdes2.5 contigs	148260	101	284175	2996547	4	4.23	1.02	98.726	2544
SPAdes2.5 scaffolds	159252	99	429536	2997079	4	4.72	1.09	98.744	2544

A5 and CLC 3.22.55708 were run with default parameters.ABySS 1.3.5, EULER-SR 2.0.1, Ray 2.2.0, SOAPdenovo 2.04, Velvet 1.2.07, and E+V-SC were run with vertex size 55.

IDBA-UD 1.1.0 was run in its default iterative mode.

The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli and S. aureus genomes covered filters out these issues (GF (%), Genome fraction (%) column).

The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size.

Misassemblies (MA) are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference.

Mismatch (substitution) error rate (MM) and number of indels (IND) per 100 kbp are measured in aligned regions of the contigs.

In each column, the best assemblers by that criteria is indicated in bold.

Related publications

S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, A. Lapidus, A. D. Prjibelsky, A. Pyshkin, A. Sirotkin, Y. Sirotkin, R. Stepanauskas, J. S. McLean, R. Lasken, S. R. Clingenpeel, T. Woyke, G. Tesler, M. A. Alekseyev, and P. A. Pevzner. Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products. Journal of Computational Biology 20(10) (2013), 714-737. doi:10.1089/cmb.2013.0084
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021
Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly. Journal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome Assembly. Lecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

“I'd like to thank you for the great job you are doing with SPAdes. It's a very useful software!”

Lionel Guy

Uppsala University, Sweden

“Thanks for your great SPAdes assembler, we have successfully assembled several cultured organims and your assembler always performed best compared to other assemblers when run on the PE- and/or MP MiSeq data we generally use.”

Dr. Harald R. Gruber-Vodicka

Symbiosis Group

Max Planck Institute of Marine Microbiology, Bremen, Germany

“We are also getting good results with SPAdes for metagenomic samples, thanks to its effort to recover as much genomic sequence as it can.”

Amr Abouelleil

Bioinformatics Assembly Analyst at Broad Institute

“I have recently used SPAdes to assembly reads generated on an Illumina platform (2 x 250 bp). The assemblies look very good!”

Mark de Been

Department of Medical Microbiology

University Medical Center Utrecht (UMCU) The Netherlands

Acknowledgements

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

SPAdes Genome Assembler

SPAdes 3.0 is out!

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features!

See all changes in changelog.

SPAdes Assembler

SPAdes manual with installation guide (ver 3.0)

dipSPAdes manual

Download SPAdes

Assembling long Illumina paired-end reads (2x150 and 2x250) application note

SPAdes on GAGE-B data sets benchmark

Benchmark for other data sets

Support e-mail: spades.support@bioinf.spbau.ru

Follow @spadesassembler

For the benchmarks we used:

MDA single-cell E. coli; 6.3 Gb, 29M reads, 2x100bp, insert size ~ 270bp (Illumina Genome Analyzer IIx)
Standard isolate E. coli; 6.2Gb, 28M reads, 2x100bp, insert size ~ 215bp (Illumina Genome Analyzer IIx)
MDA single-cell S. aureus; 14.6Gb, 33M reads, 2x100bp, insert size ~ 214bp (Illumina Genome Analyzer IIx)

E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes. S. aureus USA300 FPR3757 (chromosome and three plasmids) reference length is 2917469 bp with 2622 annotated genes.

Only contigs of 500 bp and longer were taken in consideration. Tables were obtained using QUAST 2.3.

Assembly	NG50	# contigs	Largest	Total length	MA	MM	IND	GF (%)	# genes
Single-cell E. coli
A5	14399	745	101584	4441145	8	12.01	0.14	89.880	3444
ABySS	68534	179	178720	4345617	6	3.32	0.81	88.268	3704
CLC	32506	503	113285	4656964	2	5.53	0.91	92.291	3768
EULER-SR	26662	429	140518	4248713	19	10.87	19.40	84.898	3416
Ray	45448	361	210820	4379139	17	6.22	1.29	88.372	3636
SOAPdenovo	1540	1166	51517	2958144	1	1.87	0.11	57.672	1766
Velvet	22648	261	132865	3501984	2	2.19	1.20	73.765	3080
E+V-SC	32051	344	132865	4540286	2	2.33	0.68	91.744	3771
IDBA-UD contigs	98306	244	284464	4814043	8	5.09	0.25	95.210	4045
IDBA-UD scaffolds	109057	229	284464	4813609	8	5.14	0.72	95.199	4052
SPAdes3.0 contigs	110081	240	268493	4798198	1	3.54	0.64	94.940	4038
SPAdes3.0 scaffolds	112393	234	268493	4800145	1	4.34	0.79	94.962	4043

Isolate E. coli
A5	43651	176	181690	4551797	0	0.40	0.09	98.017	4163
ABySS	106155	96	221861	4619631	2	3.77	0.39	98.974	4241
CLC	86964	112	221549	4550314	1	1.96	0.29	98.094	4205
EULER-SR	110153	100	221409	4574240	9	3.16	5.03	98.102	4192
Ray	86246	98	221942	4634429	2	2.14	0.09	96.903	4136
SOAPdenovo	49626	181	165487	4535469	0	0.15	0.09	97.696	4132
Velvet	82776	120	242032	4554702	3	2.57	0.33	98.175	4196
E+V-SC	54856	171	166115	4539639	0	1.30	0.11	97.795	4134
IDBA-UD contigs	106844	110	221687	4565529	3	3.40	0.28	98.331	4206
IDBA-UD scaffolds	133098	93	284363	4565454	4	4.08	0.59	98.355	4216
SPAdes3.0 contigs	133088	92	285414	4558033	0	2.17	0.33	98.137	4208
SPAdes3.0 scaffolds	133309	90	285414	4558337	0	2.59	0.42	98.156	4212


Single-cell S. aureus
A5	4829	937	41828	2770402	8	24.63	0.37	91.581	1815
ABySS	43173	185	175286	2899223	4	6.49	0.43	96.578	2456
EULER-SR	7247	750	66549	2988161	46	21.85	10.67	94.436	2009
Ray	62026	84	125177	2947717	13	2.29	0.96	92.936	2412
SOAPdenovo	510	1047	27317	1473402	0	1.32	0.29	46.717	595
Velvet	15656	347	67677	2746768	3	4.41	4.27	93.181	2274
E+V-SC	32296	215	107657	2932416	5	6.92	4.89	97.519	2478
IDBA-UD contigs	87549	114	175236	2996997	7	2.43	0.66	98.655	2568
IDBA-UD scaffolds	111392	99	210360	2996115	7	2.50	1.35	98.678	2574
SPAdes3.0 contigs	148260	101	284175	2996547	4	4.14	1.01	98.596	2579
SPAdes3.0 scaffolds	159252	99	429536	2997079	4	4.62	1.08	98.614	2579

A5 and CLC 3.22.55708 were run with default parameters.ABySS 1.3.5, EULER-SR 2.0.1, Ray 2.2.0, SOAPdenovo 2.04, Velvet 1.2.07, and E+V-SC were run with vertex size 55. IDBA-UD 1.1.0 was run in its default iterative mode.

The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size.

Misassemblies (MA) are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference.

Mismatch (substitution) error rate (MM) and number of indels (IND) per 100 kbp are measured in aligned regions of the contigs.

In each column, the best assemblers by that criteria is indicated in bold.

SPAdes 3.0 hybrid assemblies benchmarking on Illumina + PacBio E. coli data sets.

Assembly	NG50	# contigs	Largest	Total length	MA	MM	IND	GF (%)	# genes
E. coli K-12 Illumina only
SPAdes 3.0 contigs	133088	92	285414	4558033	0	2.17	0.33	98.137	4208
E. coli K-12 Illumina + PacBio P4
SPAdes 3.0 contigs	4647797	5	4647797	4650744	0 (6*)	8.71	0.71	99.999	4322
SPAdes 3.0 scaffolds	4647797	5	4647797	4650744	0 (6*)	8.71	0.71	99.999	4322

* Misassemblies are not real and correspond to the difference with respect to the reference

For the benchmarks we used:

E. coli K-12 MG1655 Illumina standard isolate dataset outlined above
E. coli K-12 MG1655 PacBio RS II C2/P4 dataset available from PacBio DevNet

SPAdes 3.0 experimental IonTorrent benchmarking on E. coli data sets.

Assembly	NG50	# contigs	Largest	Total length	MA	MM	IND	GF (%)	# genes
E. coli DH10B (R17-67)
SPAdes 3.0 contigs	97052	109	326325	4495193	2	1.69	8.62	95.840	4142
SPAdes 3.0 scaffolds	97052	108	326325	4495961	3	1.69	8.62	95.857	4144
E. coli O157:H7 (BEA-1108)
SPAdes 3.0 contigs	145024	220	316929	5395996	2	7.89	3.51	96.314	N/A
SPAdes 3.0 scaffolds	145024	219	316929	5396398	2	8.30	3.58	96.310	N/A

For the benchmarks we used:

E. coli DH10B (R17-67) dataset sequenced on 318v2 chip and is available on IonCommunity
E. coli O157:H7 Sakai (EHEC) (BEA-1108) dataset sequenced on 314 chip with HiQ enzyme and is available on IonCommunity

Related publications

S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, A. Lapidus, A. D. Prjibelsky, A. Pyshkin, A. Sirotkin, Y. Sirotkin, R. Stepanauskas, J. S. McLean, R. Lasken, S. R. Clingenpeel, T. Woyke, G. Tesler, M. A. Alekseyev, and P. A. Pevzner. Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products. Journal of Computational Biology 20(10) (2013), 714-737. doi:10.1089/cmb.2013.0084
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021
Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly. Journal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome Assembly. Lecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7
Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), pp. 1072-1075. doi:10.1093/bioinformatics/btt086

“I'd like to thank you for the great job you are doing with SPAdes. It's a very useful software!”

Lionel Guy

Uppsala University, Sweden

Dr. Harald R. Gruber-Vodicka

Symbiosis Group

Max Planck Institute of Marine Microbiology, Bremen, Germany

“We are also getting good results with SPAdes for metagenomic samples, thanks to its effort to recover as much genomic sequence as it can.”

Amr Abouelleil

Bioinformatics Assembly Analyst at Broad Institute

“I have recently used SPAdes to assembly reads generated on an Illumina platform (2 x 250 bp). The assemblies look very good!”

Mark de Been

Department of Medical Microbiology

University Medical Center Utrecht (UMCU) The Netherlands

Acknowledgements

SPAdes 3.0 on GAGE-B data sets

The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.

Since SPAdes 3.0 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.

The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 3.0 assembly of 500x-coverage data.

The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.

Four tables of results are presented below. For our tables, we used the format that was presented in the Supplementary Material of Magoc, et al., 2013. We used the QUality ASessment Tool (QUAST) to calculate the same metrics used in the GAGE-B paper. The GAGE-B paper used slightly different names than QUAST for some metrics; below, for each metric, we list the GAGE-B name, and indicate the QUAST name in brackets.

Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]
N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]
Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]
Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]
N50Corr, corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]
GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]
Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]
Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]

It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.3, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.

Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.

Click on the "Contigs" or "Scaffolds" links on the left side of each table to see the QUAST-generated web report.

Table 1. Assemblies of B. cereus (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	115	78	90	153	3335	105	53	404
	N50 (kb)	130.6	155.4	246.7	116.5	25.5	246.3	286.8	24.5
	Errors	2	5	9	9	17	0	1	3
	Errors-L	25	6	11	14	9	20	10	11
	N50Corr (kb)	130.6	150.5	246.7	100.0	25.5	246.3	286.8	24.5
	GenFrac (%)	98.6	99.3	99.2	99.2	98.9	98.3	98.8	97.8
	Unaligned	1	0	0	4	4	1	1	1
	Duplication	1.0	1.0	1.0	1.0	1.1	1.0	1.0	1.0

Scaffolds	Num	74	33	61	n/a	341	56	41	78
	N50 (kb)	135.6	431.5	337.9	n/a	25.5	456.6	775.7	247.7
	Errors	3	9	12	n/a	1	0	2	11
	Errors-L	29	13	13	n/a	1	39	11	258
	N50Corr (kb)	135.3	364.2	337.9	n/a	25.5	456.0	286.8	208.4
	GenFrac (%)	98.4	99.3	99.2	n/a	97.6	98.3	98.7	97.7
	Unaligned	0	0	0	n/a	0	1	0	1
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0

Table 2. Assemblies of R. sphaeroides (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	486	146	63	867	986	437	89	416
	N50 (kb)	21.4	31.5	130.7	15.8	9.1	33.5	551.2	24.0
	Errors	1	6	5	18	4	1	3	2
	Errors-L	3	3	4	6	3	11	5	9
	N50Corr (kb)	21.4	30.4	130.7	15.4	9.1	33.5	518.3	24.0
	GenFrac (%)	98.4	85.6	92.0	99.3	98.9	98.3	99.5	97.9
	Unaligned	0	0	1	0	3	19	48	1
	Duplication	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

Scaffolds	Num	382	131	52	n/a	733	185	39	143
	N50 (kb)	21.4	40.3	144.8	n/a	8.0	45.1	551.2	85.3
	Errors	1	6	5	n/a	0	4	2	19
	Errors-L	3	7	7	n/a	2	214	5	185
	N50Corr (kb)	21.4	36.1	144.8	n/a	8.0	45.0	518.3	85.0
	GenFrac (%)	97.8	85.6	91.9	n/a	88.2	98.2	99.6	97.6
	Unaligned	0	0	0	n/a	0	0	1	0
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0

Table 3. Assemblies of M. abscessus (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	210	857	326	1760	1117	113	890	279
	N50 (kb)	70.4	8.7	38.2	114.1	13.3	131.6	335.3	48.2
	Errors	2	122	70	2358	180	5	12	76
	Errors-L	2	5	2	35	4	19	6	3
	N50Corr (kb)	68.5	8.3	37.2	75.0	12.8	113.3	303.8	41.5
	GenFrac (%)	99.2	96.2	98.4	99.4	99.4	99.2	99.4	99.1
	Unaligned	11	5	1	78	8	2	844	52
	Duplication	1.0	1.0	1.1	1.2	1.0	1.0	1.0	1.0

Scaffolds	Num	147	847	324	n/a	664	79	404	154
	N50 (kb)	73.2	9.1	38.2	n/a	13.3	152.6	335.3	71.0
	Errors	2	131	70	n/a	6	5	12	120
	Errors-L	3	5	2	n/a	1	31	6	19
	N50Corr (kb)	70.1	8.5	37.2	n/a	12.8	147.2	303.8	46.0
	GenFrac (%)	98.9	96.2	98.4	n/a	99.1	99.1	99.4	99.0
	Unaligned	0	5	1	n/a	4	1	363	1
	Duplication	1.0	1.0	1.1	n/a	1.0	1.0	1.0	1.0

Table 4. Assemblies of V. cholerae (download contigs, scaffolds)

		ABySS	CABOG	MaSuRCA	MIRA	SGA	SOAPdenovo	SPAdes 3.0	Velvet
Contigs	Num	267	241	173	431	1726	244	1798	201
	N50 (kb)	60.5	32.8	76.1	112.9	27.3	71.4	355.7	92.0
	Errors	2	17	19	106	77	16	9	14
	Errors-L	0	7	3	12	3	35	7	2
	N50Corr (kb)	60.3	32.8	76.1	108.7	27.3	65.5	355.7	63.6
	GenFrac (%)	97.2	97.0	97.7	98.4	98.3	97.4	98.0	97.8
	Unaligned	2	1	0	21	6	5	1712	1
	Duplication	1.0	1.0	1.0	1.0	1.1	1.0	1.0	1.0

Scaffolds	Num	196	241	163	n/a	309	165	932	138
	N50 (kb)	60.5	32.8	76.1	n/a	27.3	91.9	355.7	110.0
	Errors	2	17	19	n/a	2	17	8	27
	Errors-L	0	7	3	n/a	1	70	6	8
	N50Corr (kb)	60.3	32.8	76.1	n/a	27.3	89.8	355.7	63.6
	GenFrac (%)	96.7	97.0	97.7	n/a	95.7	97.1	97.9	97.6
	Unaligned	1	1	0	n/a	0	2	874	1
	Duplication	1.0	1.0	1.0	n/a	1.0	1.0	1.0	1.0

SPAdes 3.0 is out

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features. Check out the details here.

AZ Orthofinder

Download

Repository: https://github.com/vladsaveliev/az_orthofinder

Installation

Just extract the archive. You will find scenario_1.py, scenario_2.py and test_input inside the extracted folder.

Note: you will need some third-party software to be installed on your system for running the tool. See section System Requirements for details.

Scenario 1

The scenario_1.py is aimed to initialize a database of orthologous groups. It generates the resulting orthogroups.tsv, and for further extention it also produces the following intermediate results for the second scenario:

— the proteomes directory with correctly adjusted proteomes,

— the annotations directory with GB files from NCBI,

— and the intermediate/blasted.tsv file.

Usage examples:

1. Fasta-proteins (optionally with annotations from prodigal). See example in test_input/proteins. Filenames will be taken as taxon codes. Uses Internet to download GB annotations; it the Internet is off, a short version of output will be produced, containing only taxon|protein ids.

./scenario_1.py --proteomes test_input/proteins -o output_test_proteomes

2. GB-annotations (Example: test_input/gbs).

./scenario_1.py --gbs test_input/gbs -o output_test_gbs

3. A list of reference ids. (Example: test_input/ids.txt). The tool will download references from Genbank. This step requires an Internet connection.

./scenario_1.py -i test_input/ids.txt -o output_test_ids

4. A list of species names (Example: test_input/species.txt). In this case, the tool will search the NCBI server using the following query (considering a species name is Escherichia coli):

Escherichia coli[Organism] AND (complete genome[Title] OR complete sequence[Title])) NOT (partial[Title])

Note that you need to specify full species name like Escherichia coli (not E. coli).

Particularly, we processed E. coli and K. pneumoniae this way. I used the files with species list (see the test_input directory):

kpneumonia_list.txt with the following single line inside:

Klebsiella pneumoniae

The command line I used:

./scenario_1.py --species test_input/kpneumonia_list.txt -o kpneumonia

ecoli_list.txt contating the following:

Escherichia coli

Command line:

./scenario_1.py --species test_input/ecoli_list.txt -o ecoli

Scenario 2

The scenario_2.py script is meant to extend orthogroups. It is required to run on a direcotory produced by the scenario_1.py script, since it extends it's blasted.tsv, and also reuses proteomes and GB annotations.

Existing blast results are used because the all-against-all blast process is the most time consuming step.

The reason we use a file with blast results is that the all-against-all blast process is the most time consuming step. Basically, the blast results can be stored between runs in a database table; nevertheless, we decided to not to rely here on SQL, because we think it is going to be more clear for users:

1. The users don't have to remember names of their talbes so they won't overwrite or damage something important;

2. The database can be cleaned up after any tool usage.

3. The results are easier to sent between computers.

There are 2 possible types of scenario_2 workflow depending on input.

1. Input is a list of reference IDs / GIs / organism names / GB annotations files.

In this case, the resulted intermediate files and othrogroups.tsv will be the same if you run the scenario_1.py on a larger input data.

2. Input are assemblies or proteomes generated by Prodigal (in case of assemblies, proteomes will be generated automatically with Prodigal). After running scenario_2 on this input, orthogroups.tsv is generated. Then each orthogroup is processed basing on its kind:

1. A group that contain only annotated genes is not processed any more.

2. A group that contains both annotated and unknown genes is not processed as well, since unknown genes can be possibly curated manually based on annotated ones.

3. If a group contains only unknown genes, it will be saved into a fasta file inside the blasted_singletones directory. Then, for each group, one of the proteins will be blasted against the public NCBI database (a local database can be also provided with the --blast-db option; otherwise, an internet connection will be used).

For each group, and XML file with blast results will be generated; the best hits will be printed to output.

Usage examples:

Appending additional list of files (fasta, gb) to an existed output after scenario 1.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_prots --proteomes test_input/new_proteins

You can pass assemblies instead, in this case they will be automatically annotated with Prodigal.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_assemblies --assemblies test_input/assemblies

Or a list of reference ids (accession numbers of gi):

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt

Existing directory must contain an intermediate subdirectory with a blasted.tsv file and proteomes folder from a scenario_1 run.

The last step is blasting new genes that didn't match any group against a local NCBI database. By default, the remote database is used, but you would rather use a local on with the --blastdb option. On chara:

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt --blastdb /gpfs/group/infection_translation/orthoMCL/app/refseq-proteins/refseq_protein

Starting from a step

There is an optional command-line argument --start-from. It is used to skip several steps of the pipeline and run right from the step specified. You can take step names from log.txt in the results folder.

./scenario_1.py -o output_test_proteomes --start-from "Parsing blast results"

./scenario_1.py -o output_test_proteomes --start-from 7

scenario_1.py steps:

1. Preparing proteomes and annotations

2. Filtering proteomes

3. Making blast database

4. Blasting

5. Parsing blast results

6. Cleaning database

7. Installing schema

8. Loading blast results into the database

9. Finding pairs

10. Dump pairs files

11. MCL

12. Saving orthogroups

scenario_2.py steps:

1. Preparing imput

2. Filtering new proteomes

3. Filtering proteomes

4. Making blast database

5. Blasting

6. Parsing blast results

7. Cleaning database

8. Installing schema

9. Loading blast results into the database

10. Finding pairs

11. Dumping pairs files

12. MCL

13. Saving orthogroups

14. Blasting singletones

Fine tuning

--prot-id-field Fields are separated by either a bar or a space. For example, with --prot-id-filed 1 fasta ids like >NC_005816.1|NP_995567.1 will lead to the protein id NP_995567.1

--min-length Minimum allowed length of proteins (default: 10)

--max-percent_stop Maximum percent stop codons (default: 20)

--evalue Blast e-value (default: 1e-5)

-t Threads number (default: 30)

-w Overwrite output directory if it exists.

System Requirements

The tool needs the following software installed on your system:

python 2.7
blast
mysql
mysql perl modules (note that src/mysql.cnf is a path from the root of the tool (az_orthofinder/src/mysql.cnf)):

$ perl -MCPAN -e shell

cpan> o conf makepl_arg "mysql_config=src/mysql.cnf"

cpan> install Data::Dumper

cpan> install DBI

cpan> force install DBD::mysql

The tool also requires a MySQL user orthomcl with password 1234, and a database orthomcl with all privileges granted to that user. It can be achieved in the following way:

$ mysqld --port=3307 & (in case if the mysql server is not running)

$ mysql -u root -p

mysql> CREATE DATABASE orthomcl;

mysql> GRANT SELECT, INSERT, UPDATE, DELETE, CREATE VIEW, CREATE, INDEX, DROP on orthomcl.* TO orthomcl@localhost;

mysql> set password for orthomcl@localhost = password("1234");

If you have any problems when setting and running the mysql server, please, let us know immediately.

1 attachment

Galaxy wrapper for SPAdes 2.5.1 is released

Thanks to our user Lionel Guy, now it is possible to integrate SPAdes into Galaxy pipelines seamlessly. Wrapper is available on Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/lionelguy/spades

SPAdes 2.5.1 is released

We are happy to announce that the version 2.5.1 of SPAdes single-cell assembler has been released.

This version contains mostly minor improvements and fixes:

more user-friendly error reports,
less misassemblies on single-cell data sets with low covered genome fraction,
decreased memory consumption during the error correction stage.

The release also contains new features:

running SPAdes from check points: one can restart the assembler after a crash without running finished steps once again,
automatic k-mer size selection for standard data sets using the maximal read length.

You can download SPAdes 2.5.1 here.

Русский

Pavel Pevzner finished the first lection on Bioinformatics Algorithms for Coursera

Is St. Petersburg, Pavel Pevzner finished his first lection for the Coursera's Bioinformatics Algorithms.

The course will cover common algorithms underlying the fundamental topics in bioinformatics: genome assembly, comparing DNA and protein sequences, finding genes and regulatory motifs, analyzing gene expression, constructing evolutionary trees, analyzing genome rearrangements, and identifying proteins.

The instructors are the Lab's leading scientist Pavel Pevzner, Phillip E. C. Compeau, and the creator of Rosalind Nikolay Vyahhi.