SPAdes stands for St. Petersburg genome assembler. It is intended for both single cell and standard (multicell) assemblies. This manual will help you to install and run SPAdes.
We recommend to run SPAdes with pre-processing (error correction) and postprocessing (contig refinement) steps.
In our experience, the error correction tools BayesHammer and Quake work well for multicell datasets. However, for single cell datasets, we recommend BayesHammer rather than Quake; Quake was not designed for single-cell datasets, and produces inferior results. The performance of SPAdes on single-cell datasets deteriorates significantly without running BayesHammer.
While SPAdes produces accurate assemblies, we recommend running NGS-Refine after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels).
BayesHammer and NGS-Refine will be released after the papers describing these tools are accepted. Meanwhile, if you need to run these tools please contact SPAdes support (spades_support at spbau.ru).
Getting SPAdes
The latest version of the source code can be downloaded from
here. The following code shows how to download and unpack the archived le directly from the command line.
wget http://bioinf.spbau.ru/sites/default/files/spades.zip
unzip spades.zip
Requirements: packages
The list of packages required for using SPAdes is given below. The command
sudo ./install_prerequirements
installs all these packages automatically if you are using apt (advanced package tool). If your operating
system does not support apt-get command you need to install the following packages manually.
package |
description |
recommended version |
gcc++-4.4 |
GNU C Compiler |
4.4 |
python2.6-dev |
Python |
2.6 |
cmake |
make system |
2.6 |
cmake-curses-gui |
curses based user interface for cmake |
2.6 |
liblog4cxx10-dev |
logging library for C++ |
|
libboost1.42-all-dev |
Boost C++ libraries |
1.42 |
zlib-bin |
compression library |
|
Requirements: RAM
It is recommended to run SPAdes on a 64-bit linux system. E.g., on a multi-cell E. coli dataset SPAdes uses about 700Mb of RAM, while on a single-cell E. coli datset SPAdes needs about 6Gb of RAM.
Compiling
When all the required packages are installed just run
./prepare_cfg
in the root directory. This collects all dependencies and runs cmake.
Preparing input data
SPAdes requires paired end reads to be in separate les. Additionally, SPAdes can use unpaired reads that normally appear after discarding one read of the paired read during error correction step. Thus input reads should be arranged into four les: paired reads left parts, paired reads right parts, unpaired reads which originally were left parts, and unpaired reads which originally were right parts. The first two les should contain the same number of reads, while there are no requirements on the number of reads in the last two files (any of them can even be empty). Files are expected to be in fasta or fastq formats and can be compressed.
In file configs/debruijn/datasets.info add a new entry according to the following self-explaining pattern (recall that parts of lines starting from semicolon are comments). Note that this file may contain any number of such entries.
ECOLI_IS220_QUAKE
{
first E.coli/s_6_1.fastq.gz ; paired left
second E.coli/s_6_2.fastq.gz ; paired right
single_first E.coli/s_6_1.single.fastq.gz ; unpaired left (optional)
single_second E.coli/s_6_2.single.fastq.gz ; unpaired right (optional)
RL 100 ; read length
single_cell false ; true if input data was obtained
; with mda (single cell) technology
reference_genome E.coli/MG1655-K12.fasta.gz ; optional
}
Note that you do not need to specify the insert size and its deviation as SPAdes computes them itself.
Running SPAdes
To run SPAdes type
./spades.py config.info
By default (i.e., if no config file is given) SPAdes uses the file spades_config.info. Running ./spades.py just after downloading and compiling it runs SPAdes on the test dataset (the rst 1Kb of E. coli) that is provided together with the source code of SPAdes.
Below we first give an example of a config file and then explain its contents in detail.
iterative_K 21 33 55
paired_mode true
dataset ECOLI_IS220_QUAKE_1K
input_dir ./data/input/
output_dir ./data/debruijn/
measure_quality true
output_to_console true
- iterative_K allows to set several k-mer sizes. Informally, smaller values of k make graph more connected, but at the same time more tangled, while higher values of k may defragment the graph, but allow to resolve short repeats. See the paper for more details.
- paired_mode turns on/o the repeat resolver.
- dataset is the name of the dataset as it is given in configs/debruijn/datasets.info (see subsection 5.2).
- input_dir is the directory where the corresponding dataset is stored.
- output_dir is the output directory.
- measure_quality flag allows to call quality estimation tool after the assembly is performed (the tool computes usual metrics like N50, genome coverage, number of misassemblies, etc).
- output_to_console flag controls outputting log messages to the console.
Understanding the output
Results can be found in data/debruijn/DATASET_NAME/DATE_TIME. The specic folder is given at the end of the log. Also, there is a folder containing statistics on dierent metrics (like N50) of the resulting contigs.
All the resulting information can be found here: ./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/
* Resulting contigs are called final_contigs.fasta
* Assessment of their quality is in quality_results
Thank you for using SPAdes!
== Assembling finished . Log can be found here :
./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/spades.log