Assembling Long Illumina Paired-End Reads (2x150 and 2x250) with SPAdes

Submitted by akorobeynikov on 6 May 2013, Mon, 17:46

Introduction

Recent advances in DNA sequencing technology led to rapid increase of a read length. Nowadays it is a common situation to have a dataset consisting of 2x150 or 2x250 paired-end reads produced by Illumina MiSeq or HiSeq2500. However, the use of longer reads alone will not automatically improve assembly quality. Proper assembler that can make use of all their advantages is needed.

As far as SPAdes uses iterative k-mer length, it allows to benefit from the full potential of the long paired-end reads. Currently one has to set the assembler options up manually but we plan to incorporate automatic calculation of necessary options soon.

Please note that not only the read length matters, but insert length does matter a lot. It is suboptimal to sequence 300bp fragment into a pair of 250bp reads. We suggest using 350-500 bp fragments with 2x150 reads and 550-700 bp fragments with 2x250 reads.

Multi-cell dataset with read length 2 x 150

General rules

Make sure your reads are corrected prior to assembly with Quake (recommended), or BayesHammer (integrated into SPAdes pipeline).
The default selection of k-mer lengths is 21, 33, 55 and might work well. If you have enough coverage (50x+), then you may want to try to set k-mer lengths of 21, 33, 55, 77.
Make sure you run assembler in a ‘Careful’ mode to minimize number of mismatches in the final contigs (you can try non-careful mode as well, it might work well with respect to mismatch rate since SPAdes 2.5).
We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.

spades.py command line

For reads corrected prior to assembly run: spades.py -k 21,33,55,77 --careful --only-assembler <your reads>
For non-corrected reads run: spades.py -k 21,33,55,77 --careful <your reads>

Multi-cell dataset with read lengths 2 x 250

General rules

Make sure your reads are corrected prior to assembly with Quake (recommended), or BayesHammer (integrated into SPAdes pipeline).

By default we suggest to increase k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher.
Make sure you run assembler in ‘Careful’ mode to minimize number of mismatches in the final contigs (you can try non-careful mode as well, it might work well with respect to mismatch rate since SPAdes 2.5).

We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.

spades.py command line

For reads corrected prior to assembly run: spades.py -k 21,33,55,77,99,127 --careful --only-assembler <your reads>
For non-corrected reads run: spades.py -k 21,33,55,77,99,127 --careful <your reads>

Single-cell dataset with read lengths 2 x 150 or 2 x 250

The default options are recommended.
However, it might be tricky to fully utilize the advantages of long reads you have. Consider contacting us for more information & discussions of assembling strategy