Skip to main content

SPAdes Manual

SPAdes stands for St. Petersburg genome assembler. It is intended for both single cell and standard (multicell) assemblies. This manual will help you to install and run SPAdes.

We recommend to run SPAdes with pre-processing (error correction) and postprocessing (contig refinement) steps.

In our experience, the error correction tools BayesHammer and Quake work well for multicell datasets. However, for single cell datasets, we recommend BayesHammer rather than Quake; Quake was not designed for single-cell datasets, and produces inferior results. The performance of SPAdes on single-cell datasets deteriorates significantly without running BayesHammer. 

While SPAdes produces accurate assemblies, we recommend running NGS-Refine after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels).

BayesHammer and NGS-Refine will be released after the papers describing these tools are accepted. Meanwhile, if you need to run these tools please contact SPAdes support (spades_support at spbau.ru).

 

Getting SPAdes
The latest version of the source code can be downloaded from here. The following code shows how to download and unpack the archived le directly from the command line.
wget http://bioinf.spbau.ru/sites/default/files/spades.zip
unzip spades.zip
 
Requirements: packages
The list of packages required for using SPAdes is given below. The command
sudo ./install_prerequirements
installs all these packages automatically if you are using apt (advanced package tool). If your operating
system does not support apt-get command you need to install the following packages manually.
 
package description recommended version
gcc++-4.4 GNU C Compiler 4.4
python2.6-dev Python 2.6
cmake make system  2.6
cmake-curses-gui curses based user interface for cmake 2.6
liblog4cxx10-dev logging library for C++  
libboost1.42-all-dev Boost C++ libraries 1.42
zlib-bin compression library  
 
Requirements: RAM
It is recommended to run SPAdes on a 64-bit linux system. E.g., on a multi-cell E. coli dataset SPAdes uses about 700Mb of RAM, while on a single-cell E. coli datset SPAdes needs about 6Gb of RAM.
 
Compiling
When all the required packages are installed just run
./prepare_cfg
in the root directory. This collects all dependencies and runs cmake.
 
Preparing input data
SPAdes requires paired end reads to be in separate les. Additionally, SPAdes can use unpaired reads that normally appear after discarding one read of the paired read during error correction step. Thus input reads should be arranged into four les: paired reads left parts, paired reads right parts, unpaired reads which originally were left parts, and unpaired reads which originally were right parts. The first two les should contain the same number of reads, while there are no requirements on the number of reads in the last two files (any of them can even be empty). Files are expected to be in fasta or fastq formats and can be compressed.
 
In fi le configs/debruijn/datasets.info add a new entry according to the following self-explaining pattern (recall that parts of lines starting from semicolon are comments). Note that this fi le may contain any number of such entries.
 

ECOLI_IS220_QUAKE

{

first             E.coli/s_6_1.fastq.gz ; paired left

second            E.coli/s_6_2.fastq.gz ; paired right

single_first      E.coli/s_6_1.single.fastq.gz ; unpaired left (optional)

single_second     E.coli/s_6_2.single.fastq.gz ; unpaired right (optional)

RL                100 ; read length

single_cell       false ; true if input data was obtained

                        ; with mda (single cell) technology

reference_genome  E.coli/MG1655-K12.fasta.gz ; optional

}

 

Note that you do not need to specify the insert size and its deviation as SPAdes computes them itself.

 

Running SPAdes

To run SPAdes type
./spades.py config.info
 
By default (i.e., if no confi g fi le is given) SPAdes uses the fi le spades_config.info. Running ./spades.py just after downloading and compiling it runs SPAdes on the test dataset (the rst 1Kb of E. coli) that is provided together with the source code of SPAdes.
Below we fi rst give an example of a con fig fi le and then explain its contents in detail.
 
iterative_K       21 33 55
paired_mode       true
dataset           ECOLI_IS220_QUAKE_1K
input_dir         ./data/input/
output_dir        ./data/debruijn/
measure_quality   true
output_to_console true
 
  • iterative_K allows to set several k-mer sizes. Informally, smaller values of k make graph more connected, but at the same time more tangled, while higher values of k may defragment the graph, but allow to resolve short repeats. See the paper for more details.
  • paired_mode turns on/o the repeat resolver.
  • dataset is the name of the dataset as it is given in configs/debruijn/datasets.info (see subsection 5.2).
  • input_dir is the directory where the corresponding dataset is stored.
  • output_dir is the output directory.
  • measure_quality flag allows to call quality estimation tool after the assembly is performed (the tool computes usual metrics like N50, genome coverage, number of misassemblies, etc).
  • output_to_console flag controls outputting log messages to the console.

 

Understanding the output

 

Results can be found in data/debruijn/DATASET_NAME/DATE_TIME. The speci c folder is given at the end of the log. Also, there is a folder containing statistics on di erent metrics (like N50) of the resulting contigs.
 
All the resulting information can be found here: ./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/
* Resulting contigs are called final_contigs.fasta
* Assessment of their quality is in quality_results
Thank you for using SPAdes!
== Assembling finished . Log can be found here :
./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/spades.log