SPAdes Manual

SPAdes stands for St. Petersburg genome assembler. It is intended for both single cell and standard (multicell) assemblies. This manual will help you to install and run SPAdes.

We recommend to run SPAdes with pre-processing (error correction) and postprocessing (contig refinement) steps.

In our experience, the error correction tools BayesHammer and Quake work well for multicell datasets. However, for single cell datasets, we recommend BayesHammer rather than Quake; Quake was not designed for single-cell datasets, and produces inferior results. The performance of SPAdes on single-cell datasets deteriorates significantly without running BayesHammer.

While SPAdes produces accurate assemblies, we recommend running NGS-Refine after SPAdes to further reduce the number of small errors (single nucleotide substitutions and small indels).

BayesHammer and NGS-Refin e will be released after the papers describing these tools are accepted. Meanwhile, if you need to run these tools please contact SPAdes support (spades_support at spbau.ru).

Getting SPAdes

The latest version of the source code can be downloaded from here. The following code shows how to download and unpack the archived le directly from the command line.

wget http://bioinf.spbau.ru/sites/default/files/spades.zip

unzip spades.zip

Requirements: packages

The list of packages required for using SPAdes is given below. The command

sudo ./install_prerequirements

installs all these packages automatically if you are using apt (advanced package tool). If your operating

system does not support apt-get command you need to install the following packages manually.

package	description	recommended version
gcc++-4.4	GNU C Compiler	4.4
python2.6-dev	Python	2.6
cmake	make system	2.6
cmake-curses-gui	curses based user interface for cmake	2.6
liblog4cxx10-dev	logging library for C++
libboost1.42-all-dev	Boost C++ libraries	1.42
zlib-bin	compression library

Requirements: RAM

It is recommended to run SPAdes on a 64-bit linux system. E.g., on a multi-cell E. coli dataset SPAdes uses about 700Mb of RAM, while on a single-cell E. coli datset SPAdes needs about 6Gb of RAM.

Compiling

When all the required packages are installed just run

./prepare_cfg

in the root directory. This collects all dependencies and runs cmake.

Preparing input data

SPAdes requires paired end reads to be in separate les. Additionally, SPAdes can use unpaired reads that normally appear after discarding one read of the paired read during error correction step. Thus input reads should be arranged into four les: paired reads left parts, paired reads right parts, unpaired reads which originally were left parts, and unpaired reads which originally were right parts. The first two les should contain the same number of reads, while there are no requirements on the number of reads in the last two files (any of them can even be empty). Files are expected to be in fasta or fastq formats and can be compressed.

In file configs/debruijn/datasets.info add a new entry according to the following self-explaining pattern (recall that parts of lines starting from semicolon are comments). Note that this file may contain any number of such entries.

ECOLI_IS220_QUAKE

{

first E.coli/s_6_1.fastq.gz ; paired left

second E.coli/s_6_2.fastq.gz ; paired right

single_first E.coli/s_6_1.single.fastq.gz ; unpaired left (optional)

single_second E.coli/s_6_2.single.fastq.gz ; unpaired right (optional)

RL 100 ; read length

single_cell false ; true if input data was obtained

; with mda (single cell) technology

reference_genome E.coli/MG1655-K12.fasta.gz ; optional

}

Note that you do not need to specify the insert size and its deviation as SPAdes computes them itself.

Running SPAdes

To run SPAdes type

./spades.py config.info

By default (i.e., if no config file is given) SPAdes uses the file spades_config.info. Running ./spades.py just after downloading and compiling it runs SPAdes on the test dataset (the rst 1Kb of E. coli) that is provided together with the source code of SPAdes.

Below we first give an example of a config file and then explain its contents in detail.

iterative_K 21 33 55

paired_mode true

dataset ECOLI_IS220_QUAKE_1K

input_dir ./data/input/

output_dir ./data/debruijn/

measure_quality true

output_to_console true

iterative_K allows to set several k-mer sizes. Informally, smaller values of k make graph more connected, but at the same time more tangled, while higher values of k may defragment the graph, but allow to resolve short repeats. See the paper for more details.
paired_mode turns on/o the repeat resolver.
dataset is the name of the dataset as it is given in configs/debruijn/datasets.info (see subsection 5.2).
input_dir is the directory where the corresponding dataset is stored.
output_dir is the output directory.
measure_quality flag allows to call quality estimation tool after the assembly is performed (the tool computes usual metrics like N50, genome coverage, number of misassemblies, etc).
output_to_console flag controls outputting log messages to the console.

Understanding the output

Results can be found in data/debruijn/DATASET_NAME/DATE_TIME. The specic folder is given at the end of the log. Also, there is a folder containing statistics on dierent metrics (like N50) of the resulting contigs.

All the resulting information can be found here: ./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/

* Resulting contigs are called final_contigs.fasta

* Assessment of their quality is in quality_results

Thank you for using SPAdes!

== Assembling finished . Log can be found here :

./data/debruijn/SAUREUS_JCVI_BH/build_02 .07_19 .05.56/spades.log