Skip to main content

Public

QUAST 2.3 released

Long-awaited contig alignment plots (see an example below), updated misassemblies detection logic, full report in PDF format, and many other features included!

See Changes for for a full list of new features and fixed bugs.

See new version of Manual including new options and reports descriptions and FAQ section.

All other news and useful links are presented on QUAST page.
 
You can download QUAST 2.3 and previous versions here.

Clone of SPAdes Genome Assembler (version 20.01.2014)

 

 

SPAdes 3.0 is out! 

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features!

See all changes in changelog

SPAdes Assembler

 

SPAdes manual with installation guide (ver 3.0)

dipSPAdes manual

Download SPAdes

Assembling long Illumina paired-end reads (2x150 and 2x250) application note

SPAdes on GAGE-B data sets benchmark

Benchmark for other data sets

Support e-mail: spades.support@bioinf.spbau.ru

 

 

 

For the benchmarks we used:

E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes. S. aureus USA300 FPR3757 (chromosome and three plasmids) reference length is 2917469 bp with 2622 annotated genes.

Only contigs of 500 bp and longer were taken in consideration. Tables were obtained using QUAST 2.3.

 

Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
Single-cell E. coli                  
A5 14399 745 101584 4441145 8 12.01 0.17 89.880 3444
ABySS 68534 179 178720 4345617 6 3.32 1.68 88.268 3704
CLC 32506 503 113285 4656964 2 5.53 1.42 92.291 3768
EULER-SR 26662 429 140518 4248713 17 10.87 35.67 84.898 3416
Ray 45448 361 210820 4379139 17 6.29 2.83 88.372 3636
SOAPdenovo 1540 1166 51517 2958144 1 1.87 0.11 57.672 1766
Velvet 22648 261 132865 3501984 2 2.19 1.23 73.765 3080
E+V-SC 32051 344 132865 4540286 2 2.35 0.73 91.744 3771
IDBA-UD contigs 98306 244 284464 4814043 8 5.09 0.27 95.210 4045
IDBA-UD scaffolds 109057 229 284464 4813609 8 5.14 0.77 95.199 4052
SPAdes2.5 contigs 110081 240 268493 4797724 1 3.52 0.64 94.926 4037
SPAdes2.5 scaffolds 112393 234 268493 4799671 1 4.36 0.79 94.948 4042
                   
Isolate E. coli                  
A5 43651 176 181690 4551797 0 0.40 0.11 98.017 4163
ABySS 106155 96 221861 4619631 2 3.77 0.41 98.974 4241
CLC 86964 112 221549 4550314 1 1.96 0.33 98.094 4205
EULER-SR 110153 100 221409 4574240 8 3.16 10.33 98.102 4192
Ray 86246 98 221942 4634429 2 2.14 0.09 96.903 4136
SOAPdenovo 49626 181 165487 4535469 0 0.15 0.11 97.696 4132
Velvet 82776 120 242032 4554702 3 2.57 0.37 98.175 4196
E+V-SC 54856 171 166115 4539639 0 1.30 0.15 97.795 4134
IDBA-UD contigs 106844 110 221687 4565529 3 3.40 0.31 98.331 4206
IDBA-UD scaffolds 133098 93 284363 4565454 4 4.08 0.61 98.355 4216
SPAdes2.5 contigs 133088 92 285414 4558033 0 2.17 0.33 98.137 4208
SPAdes2.5 scaffolds 133309 90 285414 4558337 0 2.59 0.42 98.156 4212
                   
                   
Single-cell S. aureus                  
A5 4829 937 41828 2770402 9 24.63 0.37 91.581 1815
ABySS 43173 185 175286 2899223 4 6.49 0.46 96.578 2456
EULER-SR 7247 750 66549 2988161 42 21.85 13.76 94.395 2008
Ray 62026 84 125177 2947717 13 2.29 0.96 92.936 2412
SOAPdenovo 510 1047 27317 1473402 0 1.32 0.29 46.717 595
Velvet 15656 347 67677 2746768 3 4.41 4.49 93.181 2274
E+V-SC 32296 215 107657 2932416 6 6.92 5.03 97.437 2477
IDBA-UD contigs 87549 114 175236 2996997 7 2.43 0.66 98.583 2567
IDBA-UD scaffolds 111392 99 210360 2996115 7 2.50 1.35 98.606 2573
SPAdes2.5 contigs 148260 101 284175 2996547 4 4.23 1.02 98.726 2544
SPAdes2.5 scaffolds 159252 99 429536 2997079 4 4.72 1.09 98.744 2544
 

 


A5 and CLC 3.22.55708 were run with default parameters.ABySS 1.3.5, EULER-SR 2.0.1, Ray 2.2.0, SOAPdenovo 2.04, Velvet 1.2.07, and E+V-SC were run with vertex size 55.
IDBA-UD 1.1.0 was run in its default iterative mode.
 
The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli and S. aureus genomes covered filters out these issues (GF (%), Genome fraction (%) column).
 
The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. 
 
Misassemblies (MA) are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference.
 
Mismatch (substitution) error rate (MM) and number of indels (IND) per 100 kbp are measured in aligned regions of the contigs. 
 
 
In each column, the best assemblers by that criteria is indicated in bold.
 
 

Related publications

  • S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, A. Lapidus, A. D. Prjibelsky, A. Pyshkin, A. Sirotkin, Y. Sirotkin, R. Stepanauskas, J. S. McLean, R. Lasken, S. R. Clingenpeel, T. Woyke, G. Tesler, M. A. Alekseyev, and P. A. Pevzner. Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products. Journal of Computational Biology 20(10) (2013), 714-737. doi:10.1089/cmb.2013.0084

  • Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021

  • Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome AssemblyJournal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

  • Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome AssemblyLecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
  • Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencingBMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

 

 


 
“I'd like to thank you for the great job you are doing with SPAdes. It's a very useful software!”
Lionel Guy
Uppsala University, Sweden
 
“Thanks for your great SPAdes assembler, we have successfully assembled several cultured organims and your assembler always performed best compared to other assemblers when run on the PE- and/or MP MiSeq data we generally use.”
Dr. Harald R. Gruber-Vodicka
Symbiosis Group
Max Planck Institute of Marine Microbiology, Bremen, Germany
 
“We are also getting good results with SPAdes for metagenomic samples, thanks to its effort to recover as much genomic sequence as it can.”
Amr Abouelleil
Bioinformatics Assembly Analyst at Broad Institute
 
“I have recently used SPAdes to assembly reads generated on an Illumina platform (2 x 250 bp). The assemblies look very good!”
Mark de Been
Department of Medical Microbiology
University Medical Center Utrecht (UMCU) The Netherlands

 

 

Acknowledgements

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

SPAdes Genome Assembler

 

 

SPAdes 3.0 is out! 

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features!

See all changes in changelog

SPAdes Assembler

 

SPAdes manual with installation guide (ver 3.0)

dipSPAdes manual

Download SPAdes

Assembling long Illumina paired-end reads (2x150 and 2x250) application note

SPAdes on GAGE-B data sets benchmark

Benchmark for other data sets

Support e-mail: spades.support@bioinf.spbau.ru

 

 

 

 

For the benchmarks we used:

E. coli K-12 MG1655 reference length is 4639675 bp with 4324 annotated genes. S. aureus USA300 FPR3757 (chromosome and three plasmids) reference length is 2917469 bp with 2622 annotated genes.

Only contigs of 500 bp and longer were taken in consideration. Tables were obtained using QUAST 2.3.

Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
Single-cell E. coli                  
A5 14399 745 101584 4441145 8 12.01 0.14 89.880 3444
ABySS 68534 179 178720 4345617 6 3.32 0.81 88.268 3704
CLC 32506 503 113285 4656964 2 5.53 0.91 92.291 3768
EULER-SR 26662 429 140518 4248713 19 10.87 19.40 84.898 3416
Ray 45448 361 210820 4379139 17 6.22 1.29 88.372 3636
SOAPdenovo 1540 1166 51517 2958144 1 1.87 0.11 57.672 1766
Velvet 22648 261 132865 3501984 2 2.19 1.20 73.765 3080
E+V-SC 32051 344 132865 4540286 2 2.33 0.68 91.744 3771
IDBA-UD contigs 98306 244 284464 4814043 8 5.09 0.25 95.210 4045
IDBA-UD scaffolds 109057 229 284464 4813609 8 5.14 0.72 95.199 4052
SPAdes3.0 contigs 110081 240 268493 4798198 1 3.54 0.64 94.940 4038
SPAdes3.0 scaffolds 112393 234 268493 4800145 1 4.34 0.79 94.962 4043
                   
Isolate E. coli                  
A5 43651 176 181690 4551797 0 0.40 0.09 98.017 4163
ABySS 106155 96 221861 4619631 2 3.77 0.39 98.974 4241
CLC 86964 112 221549 4550314 1 1.96 0.29 98.094 4205
EULER-SR 110153 100 221409 4574240 9 3.16 5.03 98.102 4192
Ray 86246 98 221942 4634429 2 2.14 0.09 96.903 4136
SOAPdenovo 49626 181 165487 4535469 0 0.15 0.09 97.696 4132
Velvet 82776 120 242032 4554702 3 2.57 0.33 98.175 4196
E+V-SC 54856 171 166115 4539639 0 1.30 0.11 97.795 4134
IDBA-UD contigs 106844 110 221687 4565529 3 3.40 0.28 98.331 4206
IDBA-UD scaffolds 133098 93 284363 4565454 4 4.08 0.59 98.355 4216
SPAdes3.0 contigs 133088 92 285414 4558033 0 2.17 0.33 98.137 4208
SPAdes3.0 scaffolds 133309 90 285414 4558337 0 2.59 0.42 98.156 4212
                   
                   
Single-cell S. aureus                  
A5 4829 937 41828 2770402 8 24.63 0.37 91.581 1815
ABySS 43173 185 175286 2899223 4 6.49 0.43 96.578 2456
EULER-SR 7247 750 66549 2988161 46 21.85 10.67 94.436 2009
Ray 62026 84 125177 2947717 13 2.29 0.96 92.936 2412
SOAPdenovo 510 1047 27317 1473402 0 1.32 0.29 46.717 595
Velvet 15656 347 67677 2746768 3 4.41 4.27 93.181 2274
E+V-SC 32296 215 107657 2932416 5 6.92 4.89 97.519 2478
IDBA-UD contigs 87549 114 175236 2996997 7 2.43 0.66 98.655 2568
IDBA-UD scaffolds 111392 99 210360 2996115 7 2.50 1.35 98.678 2574
SPAdes3.0 contigs 148260 101 284175 2996547 4 4.14 1.01 98.596 2579
SPAdes3.0 scaffolds 159252 99 429536 2997079 4 4.62 1.08 98.614 2579

A5 and CLC 3.22.55708 were run with default parameters.ABySS 1.3.5, EULER-SR 2.0.1, Ray 2.2.0, SOAPdenovo 2.04, Velvet 1.2.07, and E+V-SC were run with vertex size 55. IDBA-UD 1.1.0 was run in its default iterative mode.

The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli and S. aureus genomes covered filters out these issues (GF (%), Genome fraction (%) column).
The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. 
Misassemblies (MA) are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference.
Mismatch (substitution) error rate (MM) and number of indels (IND) per 100 kbp are measured in aligned regions of the contigs. 
In each column, the best assemblers by that criteria is indicated in bold.
 
SPAdes 3.0 hybrid assemblies benchmarking on Illumina + PacBio E. coli data sets.
Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
E. coli K-12 Illumina only                  
SPAdes 3.0 contigs 133088 92 285414 4558033 0 2.17 0.33 98.137 4208
E. coli K-12 Illumina + PacBio P4                  
SPAdes 3.0 contigs 4647797 5 4647797 4650744 0 (6*) 8.71 0.71 99.999 4322
SPAdes 3.0 scaffolds 4647797 5 4647797 4650744 0 (6*) 8.71 0.71 99.999 4322
* Misassemblies are not real and correspond to the difference with respect to the reference
 
For the benchmarks we used:
  • E. coli K-12 MG1655 Illumina standard isolate dataset outlined above
  • E. coli K-12 MG1655 PacBio RS II C2/P4 dataset available from PacBio DevNet
 
SPAdes 3.0 experimental IonTorrent benchmarking on E. coli data sets.
Assembly NG50 # contigs Largest Total length MA MM IND GF (%) # genes
E. coli DH10B (R17-67)                  
SPAdes 3.0 contigs 97052 109 326325 4495193 2 1.69 8.62 95.840 4142
SPAdes 3.0 scaffolds 97052 108 326325 4495961 3 1.69 8.62 95.857

4144

E. coli O157:H7 (BEA-1108)                  
SPAdes 3.0 contigs
145024
220
316929
5395996
2
7.89
3.51
96.314
N/A
SPAdes 3.0 scaffolds
145024
219
316929
5396398
2
8.30
3.58
96.310
N/A

For the benchmarks we used:

  • E. coli DH10B (R17-67) dataset sequenced on 318v2 chip and is available on IonCommunity
  • E. coli O157:H7 Sakai (EHEC) (BEA-1108) dataset sequenced on 314 chip with HiQ enzyme and is available on IonCommunity

 

Related publications

  • S. Nurk, A. Bankevich, D. Antipov, A. A. Gurevich, A. Korobeynikov, A. Lapidus, A. D. Prjibelsky, A. Pyshkin, A. Sirotkin, Y. Sirotkin, R. Stepanauskas, J. S. McLean, R. Lasken, S. R. Clingenpeel, T. Woyke, G. Tesler, M. A. Alekseyev, and P. A. Pevzner. Assembling Single-Cell Genomes and Mini-Metagenomes From Chimeric MDA Products. Journal of Computational Biology 20(10) (2013), 714-737. doi:10.1089/cmb.2013.0084

  • Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell SequencingJournal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021

  • Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome AssemblyJournal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

  • Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome AssemblyLecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
  • Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencingBMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

  • Alexey Gurevich, Vladislav Saveliev, Nikolay Vyahhi, and Glenn Tesler. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), pp. 1072-1075. doi:10.1093/bioinformatics/btt086

 


 
“I'd like to thank you for the great job you are doing with SPAdes. It's a very useful software!”
Lionel Guy
Uppsala University, Sweden
 
“Thanks for your great SPAdes assembler, we have successfully assembled several cultured organims and your assembler always performed best compared to other assemblers when run on the PE- and/or MP MiSeq data we generally use.”
Dr. Harald R. Gruber-Vodicka
Symbiosis Group
Max Planck Institute of Marine Microbiology, Bremen, Germany
 
“We are also getting good results with SPAdes for metagenomic samples, thanks to its effort to recover as much genomic sequence as it can.”
Amr Abouelleil
Bioinformatics Assembly Analyst at Broad Institute
 
“I have recently used SPAdes to assembly reads generated on an Illumina platform (2 x 250 bp). The assemblies look very good!”
Mark de Been
Department of Medical Microbiology
University Medical Center Utrecht (UMCU) The Netherlands

 

 

Acknowledgements

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

SPAdes 3.0 on GAGE-B data sets

The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.

Since SPAdes 3.0 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.

The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 3.0 assembly of 500x-coverage data.

The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.

Four tables of results are presented below. For our tables, we used the format that was presented in the Supplementary Material of Magoc, et al., 2013. We used the QUality ASessment Tool (QUAST) to calculate the same metrics used in the GAGE-B paper. The GAGE-B paper used slightly different names than QUAST for some metrics; below, for each metric, we list the GAGE-B name, and indicate the QUAST name in brackets.
  1. Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]

  2. N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]

  3. Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]

  4. Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]

  5. N50Corr,  corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]

  6. GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]

  7. Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]

  8. Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]

It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.3, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.

Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.

Click on the "Contigs" or "Scaffolds" links on the left side of each table to see the QUAST-generated web report.

 

Table 1. Assemblies of B. cereus (download contigs, scaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 115 78 90 153 3335 105 53 404
  N50 (kb) 130.6 155.4 246.7 116.5 25.5 246.3 286.8 24.5
  Errors 2 5 9 9 17 0 1 3
  Errors-L 25 6 11 14 9 20 10 11
  N50Corr (kb) 130.6 150.5 246.7 100.0 25.5 246.3 286.8 24.5
  GenFrac (%) 98.6 99.3 99.2 99.2 98.9 98.3 98.8 97.8
  Unaligned 1 0 0 4 4 1 1 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 74 33 61 n/a 341 56 41 78
  N50 (kb) 135.6 431.5 337.9 n/a 25.5 456.6 775.7 247.7
  Errors 3 9 12 n/a 1 0 2 11
  Errors-L 29 13 13 n/a 1 39 11 258
  N50Corr (kb) 135.3 364.2 337.9 n/a 25.5 456.0 286.8 208.4
  GenFrac (%) 98.4 99.3 99.2 n/a 97.6 98.3 98.7 97.7
  Unaligned 0 0 0 n/a 0 1 0 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 2. Assemblies of R. sphaeroides (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 486 146 63 867 986 437 89 416
  N50 (kb) 21.4 31.5 130.7 15.8 9.1 33.5 551.2 24.0
  Errors 1 6 5 18 4 1 3 2
  Errors-L 3 3 4 6 3 11 5 9
  N50Corr (kb) 21.4 30.4 130.7 15.4 9.1 33.5 518.3 24.0
  GenFrac (%) 98.4 85.6 92.0 99.3 98.9 98.3 99.5 97.9
  Unaligned 0 0 1 0 3 19 48 1
  Duplication 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
                   
Scaffolds Num 382 131 52 n/a 733 185 39 143
  N50 (kb) 21.4 40.3 144.8 n/a 8.0 45.1 551.2 85.3
  Errors 1 6 5 n/a 0 4 2 19
  Errors-L 3 7 7 n/a 2 214 5 185
  N50Corr (kb) 21.4 36.1 144.8 n/a 8.0 45.0 518.3 85.0
  GenFrac (%) 97.8 85.6 91.9 n/a 88.2 98.2 99.6 97.6
  Unaligned 0 0 0 n/a 0 0 1 0
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 3. Assemblies of M. abscessus (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 210 857 326 1760 1117 113 890 279
  N50 (kb) 70.4 8.7 38.2 114.1 13.3 131.6 335.3 48.2
  Errors 2 122 70 2358 180 5 12 76
  Errors-L 2 5 2 35 4 19 6 3
  N50Corr (kb) 68.5 8.3 37.2 75.0 12.8 113.3 303.8 41.5
  GenFrac (%) 99.2 96.2 98.4 99.4 99.4 99.2 99.4 99.1
  Unaligned 11 5 1 78 8 2 844 52
  Duplication 1.0 1.0 1.1 1.2 1.0 1.0 1.0 1.0
                   
Scaffolds Num 147 847 324 n/a 664 79 404 154
  N50 (kb) 73.2 9.1 38.2 n/a 13.3 152.6 335.3 71.0
  Errors 2 131 70 n/a 6 5 12 120
  Errors-L 3 5 2 n/a 1 31 6 19
  N50Corr (kb) 70.1 8.5 37.2 n/a 12.8 147.2 303.8 46.0
  GenFrac (%) 98.9 96.2 98.4 n/a 99.1 99.1 99.4 99.0
  Unaligned 0 5 1 n/a 4 1 363 1
  Duplication 1.0 1.0 1.1 n/a 1.0 1.0 1.0 1.0

 

 

Table 4. Assemblies of V. cholerae (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 3.0 Velvet
Contigs Num 267 241 173 431 1726 244 1798 201
  N50 (kb) 60.5 32.8 76.1 112.9 27.3 71.4 355.7 92.0
  Errors 2 17 19 106 77 16 9 14
  Errors-L 0 7 3 12 3 35 7 2
  N50Corr (kb) 60.3 32.8 76.1 108.7 27.3 65.5 355.7 63.6
  GenFrac (%) 97.2 97.0 97.7 98.4 98.3 97.4 98.0 97.8
  Unaligned 2 1 0 21 6 5 1712 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 196 241 163 n/a 309 165 932 138
  N50 (kb) 60.5 32.8 76.1 n/a 27.3 91.9 355.7 110.0
  Errors 2 17 19 n/a 2 17 8 27
  Errors-L 0 7 3 n/a 1 70 6 8
  N50Corr (kb) 60.3 32.8 76.1 n/a 27.3 89.8 355.7 63.6
  GenFrac (%) 96.7 97.0 97.7 n/a 95.7 97.1 97.9 97.6
  Unaligned 1 1 0 n/a 0 2 874 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

SPAdes 3.0 is out

Now with support for IonTorrent, PacBio, module for highly polymorphic diploid genomes and many other new features. Check out the details here.

AZ Orthofinder

Download

Repository: https://github.com/vladsaveliev/az_orthofinder

 

 

Installation

Just extract the archive. You will find scenario_1.py, scenario_2.py and test_input inside the extracted folder.
Note: you will need some third-party software to be installed on your system for running the tool. See section System Requirements for details.

 

Scenario 1

The scenario_1.py is aimed to initialize a database of orthologous groups. It generates the resulting orthogroups.tsv, and for further extention it also produces the following intermediate results for the second scenario: 
— the proteomes directory with correctly adjusted proteomes, 
— the annotations directory with GB files from NCBI, 
— and the intermediate/blasted.tsv file.
 
Usage examples:
1. Fasta-proteins (optionally with annotations from prodigal). See example in test_input/proteins. Filenames will be taken as taxon codes. Uses Internet to download GB annotations; it the Internet is off, a short version of output will be produced, containing only taxon|protein ids.

./scenario_1.py --proteomes test_input/proteins -o output_test_proteomes

2. GB-annotations (Example: test_input/gbs).

./scenario_1.py --gbs test_input/gbs -o output_test_gbs

3. A list of reference ids. (Example: test_input/ids.txt). The tool will download references from Genbank. This step requires an Internet connection.

./scenario_1.py -i test_input/ids.txt -o output_test_ids
 
4. A list of species names (Example: test_input/species.txt). In this case, the tool will search the NCBI server using the following query (considering a species name is Escherichia coli):
Note that you need to specify full species name like Escherichia coli (not E. coli).
 
Particularly, we processed E. coli and K. pneumoniae this way. I used the files with species list (see the test_input directory):
kpneumonia_list.txt with the following single line inside:
Klebsiella pneumoniae
The command line I used: 

./scenario_1.py --species test_input/kpneumonia_list.txt -o kpneumonia

 
ecoli_list.txt contating the following:
Escherichia coli
Command line: 
./scenario_1.py --species test_input/ecoli_list.txt -o ecoli
 
 

Scenario 2

The scenario_2.py script is meant to extend orthogroups. It is required to run on a direcotory produced by the scenario_1.py script, since it extends it's blasted.tsv, and also reuses proteomes and GB annotations
Existing blast results are used because the all-against-all blast process is the most time consuming step. 
 
The reason we use a file with blast results is that the all-against-all blast process is the most time consuming step. Basically, the blast results can be stored between runs in a database table; nevertheless, we decided to not to rely here on SQL, because we think it is going to be more clear for users:
1. The users don't have to remember names of their talbes so they won't overwrite or damage something important;
2. The database can be cleaned up after any tool usage.
3. The results are easier to sent between computers.
 
There are 2 possible types of scenario_2 workflow depending on input.
1. Input is a list of reference IDs / GIs / organism names / GB annotations files.
In this case, the resulted intermediate files and othrogroups.tsv will be the same if you run the scenario_1.py on a larger input data.
 
2. Input are assemblies or proteomes generated by Prodigal (in case of assemblies, proteomes will be generated automatically with Prodigal). After running scenario_2 on this input, orthogroups.tsv is generated. Then each orthogroup is processed basing on its kind:
1. A group that contain only annotated genes is not processed any more.
2. A group that contains both annotated and unknown genes is not processed as well, since unknown genes can be possibly curated manually based on annotated ones.
3. If a group contains only unknown genes, it will be saved into a fasta file inside the blasted_singletones directory. Then, for each group, one of the proteins will be blasted against the public NCBI database (a local database can be also provided with the --blast-db option; otherwise, an internet connection will be used).
For each group, and XML file with blast results will be generated; the best hits will be printed to output.
 
Usage examples:
Appending additional list of files (fasta, gb) to an existed output after scenario 1.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_prots --proteomes test_input/new_proteins

You can pass assemblies instead, in this case they will be automatically annotated with Prodigal.

./scenario_2.py -s1o test_proteomes -s2o test_prots_new_assemblies --assemblies test_input/assemblies

Or a list of reference ids (accession numbers of gi):

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt

Existing directory must contain an intermediate subdirectory with a blasted.tsv file and proteomes folder from a scenario_1 run.

 

The last step is blasting new genes that didn't match any group against a local NCBI database. By default, the remote database is used, but you would rather use a local on with the --blastdb option. On chara: 

./scenario_2.py -s1o test_ids -s2o test_ids_new_ids --ids test_input/new_ids.txt --blastdb /gpfs/group/infection_translation/orthoMCL/app/refseq-proteins/refseq_protein

 

Starting from a step

There is an optional command-line argument  --start-from. It is used to skip several steps of the pipeline and run right from the step specified. You can take step names from log.txt in the results folder.
./scenario_1.py -o output_test_proteomes --start-from "Parsing blast results"
or
./scenario_1.py -o output_test_proteomes --start-from 7
 
scenario_1.py steps:
1. Preparing proteomes and annotations
2. Filtering proteomes
3. Making blast database
4. Blasting
5. Parsing blast results
6. Cleaning database
7. Installing schema
8. Loading blast results into the database
9. Finding pairs
10. Dump pairs files
11. MCL
12. Saving orthogroups 
 
scenario_2.py steps:
1. Preparing imput
2. Filtering new proteomes
3. Filtering proteomes
4. Making blast database
5. Blasting
6. Parsing blast results
7. Cleaning database
8. Installing schema
9. Loading blast results into the database
10. Finding pairs
11. Dumping pairs files
12. MCL
13. Saving orthogroups
14. Blasting singletones
 
Fine tuning
--prot-id-field     Fields are separated by either a bar or a space. For example, with --prot-id-filed 1 fasta ids like >NC_005816.1|NP_995567.1 will lead to the protein id NP_995567.1
 
--min-length        Minimum allowed length of proteins (default: 10)
--max-percent_stop  Maximum percent stop codons (default: 20)
--evalue            Blast e-value (default: 1e-5)

-t                  Threads number (default: 30)

-w                  Overwrite output directory if it exists.
 
 

System Requirements

The tool needs the following software installed on your system:
  • python 2.7
  • blast
  • mysql
  • mysql perl modules (note that src/mysql.cnf is a path from the root of the tool (az_orthofinder/src/mysql.cnf)): 

$ perl -MCPAN -e shell

cpan> o conf makepl_arg "mysql_config=src/mysql.cnf"

cpan> install Data::Dumper

cpan> install DBI

cpan> force install DBD::mysql

 

The tool also requires a MySQL user orthomcl with password 1234, and a database orthomcl with all privileges granted to that user. It can be achieved in the following way:
$ mysqld --port=3307 &   (in case if the mysql server is not running)
$ mysql -u root -p
mysql> CREATE DATABASE orthomcl;
mysql> GRANT SELECT, INSERT, UPDATE, DELETE, CREATE VIEW, CREATE, INDEX, DROP on orthomcl.* TO orthomcl@localhost;
mysql> set password for orthomcl@localhost = password("1234");
 
If you have any problems when setting and running the mysql server, please, let us know immediately.
 
 

 

Galaxy wrapper for SPAdes 2.5.1 is released

Thanks to our user Lionel Guy, now it is possible to integrate SPAdes into Galaxy pipelines seamlessly. Wrapper is available on Galaxy Tool Shed at http://toolshed.g2.bx.psu.edu/view/lionelguy/spades

SPAdes 2.5.1 is released

We are happy to announce that the version 2.5.1 of SPAdes single-cell assembler has been released. 

This version contains mostly minor improvements and fixes:

  • more user-friendly error reports,
  • less misassemblies on single-cell data sets with low covered genome fraction,
  • decreased memory consumption during the error correction stage.

The release also contains new features:

  • running SPAdes from check points: one can restart the assembler after a crash without running finished steps once again,
  • automatic k-mer size selection for standard data sets using the maximal read length.

You can download SPAdes 2.5.1 here.

Pavel Pevzner finished the first lection on Bioinformatics Algorithms for Coursera

Is St. Petersburg, Pavel Pevzner finished his first lection for the Coursera's Bioinformatics Algorithms.

The course will cover common algorithms underlying the fundamental topics in bioinformatics: genome assembly, comparing DNA and protein sequences, finding genes and regulatory motifs, analyzing gene expression, constructing evolutionary trees, analyzing genome rearrangements, and identifying proteins.

The instructors are the Lab's leading scientist Pavel Pevzner, Phillip E. C. Compeau, and the creator of Rosalind Nikolay Vyahhi.

 

via @bioinforussia.

Final summer internship presentation

The final presentation of the summer internship projects took place on 6 September in the Academic University.

 

Artem Tarasov: Utilizing referenece genomes for assembly refinement.

 

 

Petar Ivanov: Applying chimeric read information for genome assembly.

 

 

Vitaliy Demyanuk: Antibody sequencing from mass spectra.

 

Syndicate content