Skip to main content

Public

Summer School for Bioinformatics

The Summer School of the Bioinformatics Institute took place in Moscow at the beginning of August.

The school was dedicated to genome and transcriptome analysis, NGS, epigenetics, comparative genomics and molecular evolution. In addition to general lections on bioinformatics, there were special courses aimed to improve programming skills for biologists, and on the other hand, teach computer scientists molecular biology and biotechnologies.

биологи и информатики

Lab's members Alla Lapidus, Andrey Prjibelsky and Pavel Pevzner took part in preparing the school curriculum. Moreover, lectures were given by Sergey Nurk, Alexey Gurevich, Anton Korobeynikov, Alla Lapidus and Andrey Prjibelsky.

 

 

 

Conference and a School in Novosibirsk

International conference HSG-2013 on high-throughput sequencing has been held in Novosibirsk on July 21 through 25, 2013. Lab's members Alla Lapidus and Andrey Prjibelsky were invited speakers in the conference:

  • Alla Lapidus: «Genome assembly and finishing–why high quality references are needed»,
  • Andrey Prjibelsky: «Genome draft assembly algorithms: from the very beginning till present-day problems».

Right before the conference, a youth scientific and practical school on genomic sequencing and data analysis took place. Alla Lapidus had a lecture about practical apllicaion of new sequencing technologies; Andrey Prjibelsky explained an easy way to assemble a genome from NGS data in 30 minutes.

SPAdes 2.5 released: now supports multiple read-pair libraries

Version 2.5.0 of SPAdes single-cell assembler has been just released. 

The main feature of the new version is the support for multiple paired-end and mate-pair libraries. By using the command line interface you can specify up to 5 libraries of each type and, as previously, unlimited number of single read libraries. However, you can write to our support if you need to assemble more libraries simultaneously. 

Also, we changed repeat resolution strategy and thus decreased mismatch and indel rates comparing to the previous versions.
 
You can download SPAdes 2.5 here.

QUAST 2.2 released

QUAST now supports evaluation of metagenomic assemblies. The tool accepts multiple references, and produces several reports:
  — for all contigs and all input genomes merged into one,
  — separate reports for only contigs aligned to a particular genome,
  — for the contigs not aligned to any reference provided.

Usage:
       metaquast.py contigs_1 contigs_2 ... -R reference_1,reference_2,reference_3,...
 
All other options for metaquast.py are the same as for quast.py.
In addition, MetaGeneMark is used for finding genes in metagenomic assemblies. In metaquast.py by default, in quast.py with --meta option.
 
Other changes include a new option --labels (or -l) for providing human-readable assembly names. Those names will be used in reports, plots and logs, instead of file names. For example:

   -l SPAdes,IDBA-UD

If your labels include spaces, use quotes:  

   -l SPAdes,"Assembly 2",Assembly3

   -l "SPAdes 2.5, SPAdes 2.4, IDBA-UD"

A one more important change: in place of --allow-ambiguity, a new option --ambiguity-usage (-a) introduced; it lets specify a particular way to process ambiguous regions: -a one, -a all or -a none.

We also fixed some bugs in misassemblies detection algorithm.

You can download QUAST 2.2 here.

SPAdes 2.5 on GAGE-B data sets

The recently published GAGE-B paper (Magoc, et al., 2013) presents an evaluation of several popular assemblers, including SPAdes 2.3.

Since SPAdes 2.5 is out, we evaluated it on the data sets from the GAGE-B study. Four MiSeq data sets (B. cereus, R. sphaeroides, M. abscessus, and V. cholerae) were selected for the assessment. Since these reads are 250 bp length, we applied our recommendations for assembling long Illumina paired-end reads with SPAdes.

The original coverage of those data sets is about 500x. However, in the GAGE-B experiments, all data was down-sampled to 100x coverage, because higher coverage barely affected contig size. Meanwhile, SPAdes benefits from high coverage, so we decided to assemble the data sets with the original ~500x coverage. Our tables contain the GAGE-B assemblies of 100x-coverage data, and the SPAdes 2.5 assembly of 500x-coverage data.

The B. cereus data was downloaded from the official Illumina website. The other three data sets were obtained from the Sequence Read Archive at NIH’s National Center for Biotechnology Information (NCBI): SRR522246 (R. sphaeroides), SRR768269 (M. abscessus), SRR769320 (V. cholerae). Genome references and contigs produced by other assemblers mentioned in the GAGE-B study were downloaded from the GAGE-B website.

Four tables of results are presented below. For our tables, we used the format that was presented in the Supplementary Material of Magoc, et al., 2013. We used the QUality ASessment Tool (QUAST) to calculate the same metrics used in the GAGE-B paper. The GAGE-B paper used slightly different names than QUAST for some metrics; below, for each metric, we list the GAGE-B name, and indicate the QUAST name in brackets.
  1. Num, the number of contigs (or scaffolds) at least 200bp long (500bp for scaffolds). [# contigs]

  2. N50 size, which is the size of the smallest contig such that 50% of the genome is contained in contigs of size N50 or larger. [NG50]

  3. Errors, determined by comparison to the reference genome. We defined this as the sum of the number of relocations, translocations, and inversions affecting at least 1000bp. A relocation is defined as a misjoin in a contig/scaffold such that if the contig/scaffold is split into two pieces at the misjoin, then the left and right pieces map to distinct locations on the reference genome that are separated by at least 1000bp, or that overlap by at least 1000bp. A translocation is defined as a misjoin where the left and the right pieces map to different chromosomes or plasmids. An inversion is defined as a misjoin such that the left and the right pieces map to opposite strands on the same chromosome. [# misassemblies]

  4. Errors-L, local errors, defined as misjoins where the left and right pieces map onto the reference genome to distinct locations that are less than 1000bp apart, or that overlap by less than 1000bp. [# local misassemblies]

  5. N50Corr,  corrected N50 size, defined as the N50 size obtained after splitting contigs/scaffolds at each error. Note that local errors were not used for the purpose of calculating corrected N50 values. [NGA50]

  6. GenFrac, the fraction of the reference genome covered by contigs/scaffolds. [Genome Fraction]

  7. Unaligned, the number of unaligned contigs, computed as the number of contigs that MUMmer (Delcher, et al., 1999; Delcher, et al., 2002; Kurtz, et al., 2004) was not able to align, even partially, to the reference genome. [# unaligned]

  8. Duplication, duplication ratio, an approximation of the amount of overlaps among contigs/scaffolds that should have been merged. Failure to merge overlaps leads to overestimation of the genome size and creates two copies of sequences that exist in just one copy. [Duplication ratio]

It is important to note that Magoc, et al., 2013 used QUAST 1.3 for assessing quality of the assemblies. We used the latest version of QUAST, 2.2, so some statistics in the tables may slightly differ from the ones in the GAGE-B Supplementary Material. The main difference between these versions is in computing Genome Fraction. QUAST 2.* filters MUMmer's alignments to keep only best ones. Roughly speaking, it skips ambiguous and redundant alignments to keep one alignment (or one set of non-overlapping or slightly-overlapping alignments in case of a misassembly) per each contig. QUAST 1.* uses all of MUMmer's alignments to compute Genome Fraction. The Duplication ratio metric is also affected by this change. In addition, several bugs in QUAST were fixed, which affect detection of misassemblies (and thus, the Errors, Errors-L, and N50Corr statistics). See QUAST changelog for more details.

Finished references and MiSeq reads that have been used to assemble B. cereus and R. sphaeroides (Magoc, et al., 2013) correspond to exactly the same strains of each microorganism. References used for M. abscessus and V. cholera, however, belong to similar, but distinct strains. It is therefore possible that some of the differences between the de novo assembled contigs of M. abscessus and V. cholerae and the corresponding genome references represent true differences rather than errors.

Click on the "Contigs" or "Scaffolds" links on the left side of each table to see the QUAST-generated web report.

 

Table 1. Assemblies of B. cereus (download contigs, scaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 2.5 Velvet
Contigs Num 115 78 90 153 3335 105 513 404
  N50 (kb) 130.6 155.4 246.7 116.5 25.5 246.3 485.5 24.5
  Errors 2 5 9 9 17 0 0 3
  Errors-L 25 6 11 14 9 21 7 11
  N50Corr (kb) 130.6 150.5 246.7 100.0 25.5 246.3 485.5 24.5
  GenFrac (%) 98.6 99.3 99.2 99.2 98.9 98.3 98.8 97.8
  Unaligned 1 0 0 4 4 1 335 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 74 33 61 n/a 341 56 53 78
  N50 (kb) 135.6 431.5 337.9 n/a 25.5 456.6 485.5 247.7
  Errors 3 10 12 n/a 1 0 0 7
  Errors-L 29 12 13 n/a 1 41 9 263
  N50Corr (kb) 135.3 288.2 337.9 n/a 25.5 456.0 485.5 208.4
  GenFrac (%) 98.4 99.3 99.2 n/a 97.6 98.3 98.8 97.8
  Unaligned 0 0 0 n/a 0 1 12 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 2. Assemblies of R. sphaeroides (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 2.5 Velvet
Contigs Num 486 146 63 867 986 437 95 416
  N50 (kb) 21.4 31.5 130.7 15.8 9.1 33.5 518.1 24.0
  Errors 2 6 5 20 5 1 7 3
  Errors-L 3 3 5 7 3 11 7 9
  N50Corr (kb) 21.4 30.4 130.7 15.4 9.1 33.5 518.1 24.0
  GenFrac (%) 98.4 85.6 92.0 99.3 98.9 98.3 99.5 97.9
  Unaligned 0 0 1 0 3 19 50 1
  Duplication 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
                   
Scaffolds Num 382 131 52 n/a 733 185 40 143
  N50 (kb) 21.4 40.3 144.8 n/a 8.0 45.1 518.1 85.3
  Errors 2 6 5 n/a 0 2 9 16
  Errors-L 3 7 8 n/a 2 216 8 189
  N50Corr (kb) 21.4 36.1 144.8 n/a 8.0 45.0 518.1 85.0
  GenFrac (%) 97.8 85.6 91.9 n/a 88.2 98.2 99.5 97.6
  Unaligned 0 0 0 n/a 0 0 1 0
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

 

Table 3. Assemblies of M. abscessus (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 2.5 Velvet
Contigs Num 210 857 326 1760 1117 113 898 279
  N50 (kb) 70.4 8.7 38.2 114.1 13.3 131.6 313.5 48.2
  Errors 2 122 70 2356 180 5 13 76
  Errors-L 2 5 2 37 4 19 7 3
  N50Corr (kb) 68.5 8.3 37.2 75.0 12.8 113.3 280.1 41.5
  GenFrac (%) 99.2 96.2 98.4 99.4 99.4 99.2 99.4 99.1
  Unaligned 11 5 1 78 8 2 850 52
  Duplication 1.0 1.0 1.1 1.2 1.0 1.0 1.0 1.0
                   
Scaffolds Num 147 847 324 n/a 664 79 410 154
  N50 (kb) 73.2 9.1 38.2 n/a 13.3 152.6 313.5 71.0
  Errors 2 123 70 n/a 6 5 12 103
  Errors-L 3 13 2 n/a 1 31 7 36
  N50Corr (kb) 70.1 8.5 37.2 n/a 12.8 147.2 280.1 51.7
  GenFrac (%) 98.9 96.2 98.4 n/a 99.1 99.1 99.4 99.0
  Unaligned 0 5 1 n/a 4 1 367 1
  Duplication 1.0 1.0 1.1 n/a 1.0 1.0 1.0 1.0

 

 

Table 4. Assemblies of V. cholerae (download contigsscaffolds)

    ABySS CABOG MaSuRCA MIRA SGA SOAPdenovo SPAdes 2.5 Velvet
Contigs Num 267 241 173 431 1726 244 1800 201
  N50 (kb) 60.5 32.8 76.1 112.9 27.3 71.4 356.1 92.0
  Errors 2 17 19 109 77 12 5 14
  Errors-L 0 7 3 12 3 44 7 2
  N50Corr (kb) 60.3 32.8 76.1 108.7 27.3 68.2 356.1 63.6
  GenFrac (%) 97.2 97.0 97.7 98.5 98.3 97.4 98.0 97.8
  Unaligned 2 1 0 21 6 5 1712 1
  Duplication 1.0 1.0 1.0 1.0 1.1 1.0 1.0 1.0
                   
Scaffolds Num 196 241 163 n/a 309 165 937 138
  N50 (kb) 60.5 32.8 76.1 n/a 27.3 91.9 356.1 110.0
  Errors 2 17 19 n/a 2 14 5 22
  Errors-L 0 7 3 n/a 1 77 6 13
  N50Corr (kb) 60.3 32.8 76.1 n/a 27.3 89.8 356.1 67.1
  GenFrac (%) 96.7 97.0 97.7 n/a 95.7 97.1 97.9 97.6
  Unaligned 1 1 0 n/a 0 2 880 1
  Duplication 1.0 1.0 1.0 n/a 1.0 1.0 1.0 1.0

 

Seminar at Repino

At the beginning of June, the Lab participated at the joint bioinformatics seminar at Repino, together with the Dobzhansky Center for Genome Bioinformatics (St. Petersburg State University). Guided by Dr. Stephen O'Brien, the researchers have discussed their projects, shared knowledge and experience, and planned collaboration.

Paper of McLean JS et al. P. gingivalis assembly using SPAdes was recognized as the top research paper by F1000

Faculty 1000 recognized “Genome of the pathogen Porphyromonas gingivalis recovered from a biofilm in a hospital sink using a high-throughput single-cell genomics platform” as the top research paper by F1000. Authors used SPAdes to perform single-cell assembly.

The paper was recommended as being of special significance in its field by our Faculty Member Edward Feil. You can read Dr Feil's recommendation at http://f1000.com/prime/718011804?subscriptioncode=c3554e8b-dea8-45c0-bc5.... It requires a subscription to F1000Prime, but it's posible to activate a 3-month subscription to the site via the link.

 

 

 

 

 

 

 

 

 

 

 

 

Vladislav Saveliev

vladsaveliev@me.com
 
Education:
2011 — present: MS student of Software Engineering department, St. Petersburg State University of Academy of Sciences, Russia
2010 — 2011: MS student of Geoinformatics department, St. Petersburg State University of Information Technologies, Mechanics and Optics, Russia
2006 — 2010: Bachelor's degree in Computer Science and Computing Hardware, St. Petersburg State University of Information Technologies, Mechanics and Optics, Russia
 
Projects:
1. QUAST
2. Metagenomic assembly

Ksenia Krasheninnikova

Personal: krasheninnikova@gmail.com

 

Education:

2006 - 2011 Togliatti State University, Mathematics and Informatics Faculty, Department of Applied Mathematics and Informatics, Diploma in Applied Informatics

2011 - 2013 St.Petersburg Academic University, Department of Applied Mathematics and Informational Technologies, MS in Bioinformatics

 

Current project:

Use of uneven read coverage depth in bacterial single-cell repeat resolution

 

 


 

Andrey Balandin

Education:

  • 1998 — 2004: South Ural State University, Chelyabinsk, Russia.

Projects:

Syndicate content