| Algorithmic Biology Lab

Pavel Pevzner finished the first lection on Bioinformatics Algorithms for Coursera

Is St. Petersburg, Pavel Pevzner finished his first lection for the Coursera's Bioinformatics Algorithms.

The course will cover common algorithms underlying the fundamental topics in bioinformatics: genome assembly, comparing DNA and protein sequences, finding genes and regulatory motifs, analyzing gene expression, constructing evolutionary trees, analyzing genome rearrangements, and identifying proteins.

The instructors are the Lab's leading scientist Pavel Pevzner, Phillip E. C. Compeau, and the creator of Rosalind Nikolay Vyahhi.

via @bioinforussia.

Русский

Final summer internship presentation

The final presentation of the summer internship projects took place on 6 September in the Academic University.

Artem Tarasov: Utilizing referenece genomes for assembly refinement.

Petar Ivanov: Applying chimeric read information for genome assembly.

Vitaliy Demyanuk: Antibody sequencing from mass spectra.

Русский

Summer School for Bioinformatics

The Summer School of the Bioinformatics Institute took place in Moscow at the beginning of August.

The school was dedicated to genome and transcriptome analysis, NGS, epigenetics, comparative genomics and molecular evolution. In addition to general lections on bioinformatics, there were special courses aimed to improve programming skills for biologists, and on the other hand, teach computer scientists molecular biology and biotechnologies.

биологи и информатики

Lab's members Alla Lapidus, Andrey Prjibelsky and Pavel Pevzner took part in preparing the school curriculum. Moreover, lectures were given by Sergey Nurk, Alexey Gurevich, Anton Korobeynikov, Alla Lapidus and Andrey Prjibelsky.

Русский

Conference and a School in Novosibirsk

International conference HSG-2013 on high-throughput sequencing has been held in Novosibirsk on July 21 through 25, 2013. Lab's members Alla Lapidus and Andrey Prjibelsky were invited speakers in the conference:

Alla Lapidus: «Genome assembly and finishing–why high quality references are needed»,
Andrey Prjibelsky: «Genome draft assembly algorithms: from the very beginning till present-day problems».

Right before the conference, a youth scientific and practical school on genomic sequencing and data analysis took place. Alla Lapidus had a lecture about practical apllicaion of new sequencing technologies; Andrey Prjibelsky explained an easy way to assemble a genome from NGS data in 30 minutes.

Русский

SPAdes 2.5 released: now supports multiple read-pair libraries

Version 2.5.0 of SPAdes single-cell assembler has been just released.

The main feature of the new version is the support for multiple paired-end and mate-pair libraries. By using the command line interface you can specify up to 5 libraries of each type and, as previously, unlimited number of single read libraries. However, you can write to our support if you need to assemble more libraries simultaneously.

Also, we changed repeat resolution strategy and thus decreased mismatch and indel rates comparing to the previous versions.

You can download SPAdes 2.5 here.

Русский

QUAST 2.2 released

QUAST now supports evaluation of metagenomic assemblies. The tool accepts multiple references, and produces several reports:
— for all contigs and all input genomes merged into one,
— separate reports for only contigs aligned to a particular genome,
— for the contigs not aligned to any reference provided.

Usage:

metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...

All other options for metaquast.py are the same as for quast.py.

In addition, MetaGeneMark is used for finding genes in metagenomic assemblies. In metaquast.py by default, in quast.py with --meta option.

Other changes include a new option --labels (or -l) for providing human-readable assembly names. Those names will be used in reports, plots and logs, instead of file names. For example:

-l SPAdes,IDBA-UD

If your labels include spaces, use quotes:

-l SPAdes,"Assembly 2",Assembly3

-l "SPAdes 2.5, SPAdes 2.4, IDBA-UD"

A one more important change: in place of --allow-ambiguity, a new option --ambiguity-usage (-a) introduced; it lets specify a particular way to process ambiguous regions: -a one, -a all or -a none.

We also fixed some bugs in misassemblies detection algorithm.

You can download QUAST 2.2 here.

Русский

Seminar at Repino

At the beginning of June, the Lab participated at the joint bioinformatics seminar at Repino, together with the Dobzhansky Center for Genome Bioinformatics (St. Petersburg State University). Guided by Dr. Stephen O'Brien, the researchers have discussed their projects, shared knowledge and experience, and planned collaboration.

Русский

Paper of McLean JS et al. P. gingivalis assembly using SPAdes was recognized as the top research paper by F1000

Faculty 1000 recognized “Genome of the pathogen Porphyromonas gingivalis recovered from a biofilm in a hospital sink using a high-throughput single-cell genomics platform” as the top research paper by F1000. Authors used SPAdes to perform single-cell assembly.

The paper was recommended as being of special significance in its field by our Faculty Member Edward Feil. You can read Dr Feil's recommendation at http://f1000.com/prime/718011804?subscriptioncode=c3554e8b-dea8-45c0-bc5.... It requires a subscription to F1000Prime, but it's posible to activate a 3-month subscription to the site via the link.

Assembling Long Illumina Paired-End Reads (2x150 and 2x250) with SPAdes

Submitted by akorobeynikov on 6 May 2013, Mon, 17:46

Introduction

Recent advances in DNA sequencing technology led to rapid increase of a read length. Nowadays it is a common situation to have a dataset consisting of 2x150 or 2x250 paired-end reads produced by Illumina MiSeq or HiSeq2500. However, the use of longer reads alone will not automatically improve assembly quality. Proper assembler that can make use of all their advantages is needed.

As far as SPAdes uses iterative k-mer length, it allows to benefit from the full potential of the long paired-end reads. Currently one has to set the assembler options up manually but we plan to incorporate automatic calculation of necessary options soon.

Please note that not only the read length matters, but insert length does matter a lot. It is suboptimal to sequence 300bp fragment into a pair of 250bp reads. We suggest using 350-500 bp fragments with 2x150 reads and 550-700 bp fragments with 2x250 reads.

Multi-cell dataset with read length 2 x 150

General rules

Make sure your reads are corrected prior to assembly with Quake (recommended), or BayesHammer (integrated into SPAdes pipeline).
The default selection of k-mer lengths is 21, 33, 55 and might work well. If you have enough coverage (50x+), then you may want to try to set k-mer lengths of 21, 33, 55, 77.
Make sure you run assembler in a ‘Careful’ mode to minimize number of mismatches in the final contigs (you can try non-careful mode as well, it might work well with respect to mismatch rate since SPAdes 2.5).
We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.

spades.py command line

For reads corrected prior to assembly run: spades.py -k 21,33,55,77 --careful --only-assembler <your reads>
For non-corrected reads run: spades.py -k 21,33,55,77 --careful <your reads>

Multi-cell dataset with read lengths 2 x 250

General rules

Make sure your reads are corrected prior to assembly with Quake (recommended), or BayesHammer (integrated into SPAdes pipeline).

By default we suggest to increase k-mer lengths in increments of 22 until the k-mer length reaches 127. The exact length of the k-mer depends on the coverage: k-mer length of 127 corresponds to 50x k-mer coverage and higher.
Make sure you run assembler in ‘Careful’ mode to minimize number of mismatches in the final contigs (you can try non-careful mode as well, it might work well with respect to mismatch rate since SPAdes 2.5).

We recommend you to check the SPAdes log file at the end of the each iteration to control the average coverage of the contigs.

spades.py command line

For reads corrected prior to assembly run: spades.py -k 21,33,55,77,99,127 --careful --only-assembler <your reads>
For non-corrected reads run: spades.py -k 21,33,55,77,99,127 --careful <your reads>

Single-cell dataset with read lengths 2 x 150 or 2 x 250

The default options are recommended.
However, it might be tricky to fully utilize the advantages of long reads you have. Consider contacting us for more information & discussions of assembling strategy

Union-Tribune spotlights the P. gingivalis paper

Submitted by kira on 14 Apr 2013, Sun, 23:04

The San Diego Union-Tribune published an article on the P. gingivalis paper (McLean et al., 2013). Congratulations go to Sergey Nurk who significantly contributed to and co-authored this work!