Skip to main content

Public

Logistics

About St. Petersburg and venues.

Computational Proteomics

Our research in computational proteomics mainly lies in the area of top-down mass spectrometry, which is a novel highly promising technology for acquiring mass spectra. In contrast to the traditional bottom-up approach, it does not require protein digestion prior to tandem mass spectrometry step. Analysis of intact proteins offers certain advantages, such as possibilities to detect post-translational modifications in a coordinated fashion and to identify multiple protein species.  

 

Researchers

Kira Vyatkina

Alumni 

Sonya Alexandrova

Mikhail Dvorkin

Yakov Sirotkin 

Interns (Summer 2011):

Maxim Gladkikh

Yuri Zemlyanskiy

Andrey Lushnikov

Ilya Makeev

Student (Fall 2011):

Ksenia Krasheninnikova

 

Current projects

Tag generation for top-down mass spectra

(joint project with Pavel Pevzner’s lab at UCSD)

A peptide sequence tag (PST) is a short sequence of amino acids. In bottom-up mass spectrometry, PSTs are successfully used for spectra interpretation; however, in the top-down case, possibilities of their generation and usage have not yet been explored sufficiently. In the frame of this project, we propose and analyze methods of PST generation for top-down spectra, and indicate their potential applications to spectra identification and mixed spectra interpretation.

 

Paper:

Yakov Sirotkin, Xiaowen Liu, Maxim Gladkikh, Pavel Pevzner and Kira Vyatkina, “Peptide Sequence Tags for Top-Down Spectra”. (accepted to RECOMB CP 2012)

 

Software:

MS-Align+Tag (download)

 

Error correction for top-down mass spectra

(joint project with Pavel Pevzner’s lab at UCSD)

The procedure of spectrum interpretation starts with retrieval of isotopomer envelopes from a given spectrum, followed by derivation of monoisitopic masses from those envelopes. As a result, we obtain a deconvoluted spectrum. However, ±1Da errors are often observed in the masses composing deconvoluted spectra, which can impose serious problems in subsequent spectrum identification. The goal of this project is to eliminate this kind of errors.

 

Interpretation of mass spectra of substances resulting from chemical experiments

(joint project with Laboratory of Nanobiotechnologies, Academic University, headed by Corr. Mem. of RAS M.V. Dubina)

The goal of this project is to interpret mass spectra of substances, which are expected to contain peptides. Such hypothesis can be confirmed by retrieving an alphabet of amino acids composing the peptides present in a substance, and further explaining the given mass spectra.

 

Interpretation of multiplex mass spectra

Some mass spectra turn out to be produced from a mixture of proteins rather than from a single protein. They are usually referred to as mixed, or multiplex. This project aims to find a method for interpreting such spectra.

 

Completed projects

Protein identification using top-down spectra

(joint project with Pavel Pevzner’s lab at UCSD)

This project was devoted to development of a fast method for top-down protein identification, which allows searching for unexpected post-translational modifications. The proposed algorithm, MS-Align+, performs significantly better than previously existing approaches on two top-down datasets used for benchmarking  such software tools.

 

Paper:

Xiaowen Liu, Yakov Sirotkin, Yufeng Shen, Gordon Anderson, Yihsuan S. Tsai, Ying S. Ting, David R. Goodlett, Richard D. Smith, Vineet Bafna and Pavel A. Pevzner,  “Protein identification using top-down spectra”. “Molecular and Cellular Proteomics”. 2011 Oct 25. [Epub ahead of print]

 

Protein morphing

(joint project with Pavel Pevzner’s lab at UCSD, Burnham Institute for Medical Research, and Joint Center for Structural Genomics, Bioinformatics Core)

Within this project, we developed an efficient algorithm for protein morphing based on linear interpolation and implemented it as a web server. 

 

Paper:

Natalie E. Castellana, Andrey Lushnikov, Piotr Rotkiewicz, Natasha Sefcovic, Pavel A. Pevzner, and Adam Godzik, Kira Vyatkina, “MORPH-PRO: A Novel Algorithm and Web Server for Protein Morphing”. In Proc. The 12th workshop on Algorithms in Bioinformatics (WABI 2012), September 10-12, Ljubljana, Slovenia, LNCS 7534, Springer, 2012, 12pp. (to appear) (Appendix)

 

Collaboration

Our research in computational mass spectrometry is carried out in the frame of close collaboration with Pacific Northwest National Laboratory (PNNL).

 

 

 

 

 

BayesHammer

BayesHammer: Bayesian Clustering for Error Correction in Single-Cell Sequencing

Sergey I. Nikolenko, Anton I. Korobeynikov, Max A. Alekseyev
BMC Genomics 2013, 14(Suppl 1):S7

 

Error correction for sequenced reads remains difficult, especially for single-cell sequencing projects with extremely non-uniform coverage. We present the BayesHammer error correction tool that uses Bayesian subclustering to correct sequencing reads. While BayesHammer was designed for single-cell sequencing, we demonstrate that it also improves on state-of-the-art error correction tools for standard (multi-cell) sequencing data.

What is Single Cell Genomics?

Most bacteria in environments ranging from the human body to the ocean cannot be cloned in the laboratory and thus cannot be sequenced using existing Next Generation Sequencing (NGS) technologies. This represents the key bottleneck for various projects ranging from the Human Microbiome Project (HMP) [3, 6] to antibiotics discovery [9]. For example, the key question in the Human Microbiome Project is how bacteria interact with each other. These interactions are often conducted by various peptides that are produced either for communication with other bacteria or for killing them. However, peptidomics studies of the human microbiome are now limited since mass spectrometry (the key technology for such studies) requires knowledge of fairly complete proteomes. On the other hand, while studies of new peptide antibiotics would greatly benefit from DNA sequencing of genes coding for Non-Ribosomal Peptide Syntetases (NRPS) [11, 13], existing metagenomics approaches are unable to sequence these exceptionally long genes (over 60,000 nucleotides).

HMP and discovery of new antibiotics are just two examples of many projects that would be revolutionized by Single Cell Sequencing (SCS). Recent improvements in both experimental [4, 7, 8, 10] and computational [1] aspects of SCS have opened the possibility of sequencing bacterial genomes from single cells. In particular, [1] demonstrated that SCS can capture a large number of genes, sufficient for inferring the organism’s metabolism. In many applications (including proteomics and antibiotics discovery), having a great majority of genes captured is almost as useful as having complete genomes.

Currently, Multiple Displacement Amplification (MDA) is the dominant technology for single cell amplification [2]. However, MDA introduces extreme amplification bias (orders-of-magnitude difference in coverage between different regions) and gives rise to chimeric reads and read-pairs that complicate the ensuing assembly.1 Acknowledging the fact that existing assemblers were not designed to handle these complications, Rodrigue et al., 2009 [12] remarked that the challenges facing SCS are increasingly computational rather than experimental. A recent paper [5] illustrates that existing assemblers produce inferior results for single cell projects even when the goal is to assemble a single NRPS, let alone a complete genome.

Chitsaz et al., 2011 [1] introduced the E+V-SC assembler, combining parts of EULER-SR with a modified Velvet, and achieved a significant improvement in the quality of SCS. However, as the authors of E+V-SC realized, one needs to change algorithmic design (rather than just modify existing tools like Velvet) to fully utilize the potential of SCS.

We present the SPAdes assembler, introducing a number of new algorithmic solutions and improving on state-of-the-art assemblers for both SCS and standard (multicell) bacterial datasets.

References:

  1. H. Chitsaz, J.L. Yee-Greenbaum, G. Tesler, M.J. Lombardo, C.L. Dupont, J.H. Badger, M. Novotny, D.B. Rusch, L.J. Fraser, N.A. Gormley, O. Schulz-Trieglaff, G.P. Smith, D.J. Evers, P.A. Pevzner, and R.S. Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol, 29(10):915–921, 2011.
  2. F.B. Dean, J.R. Nelson, T.L. Giesler, and R.S. Lasken. Rapid amplification of plasmid and phage DNA using phi 29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res, 11(6):1095–1099, Jun 2001.
  3. S.R. Gill, M. Pop, R.T. Deboy, P.B. Eckburg, P.J. Turnbaugh, B.S. Samuel, J.I. Gordon, D.A. Relman, C.M. Fraser-Liggett, and K.E. Nelson. Metagenomic analysis of the human distal gut microbiome. Science, 312(5778):1355–1359, Jun 2006.
  4. J.P. Glotzbach, M. Januszyk, I.N. Vial, V.W. Wong, A. Gelbard, T. Kalisky, H. Thangarajah, M.T. Longaker, S.R. Quake, G. Chu, and G.C. Gurtner. An information theoretic, microfluidic-based single cell analysis permits identification of subpopulations among putatively homogeneous stem cells. PLoS One, 6(6):e21211, 2011.
  5. R.V. Grindberg, T. Ishoey, D. Brinza, E. Esquenazi, R.C. Coates, W.T. Liu, L. Gerwick, P.C. Dorrestein, P. Pevzner, R. Lasken, and W.H. Gerwick. Single cell genome amplification accelerates identification of the apratoxin biosynthetic pathway from a complex microbial assemblage. PLoS One, 6(4):e18565, 2011.
  6. M. Hamadyand, R. Knight. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res, 19(7):1141–1152, Jul 2009.
  7. T. Ishoey, T. Woyke, R. Stepanauskas, M. Novotny, and R.S. Lasken. Genomic sequencing of single microbial cells from environmental samples. Current Opinion in Microbiology, 11(3):198–204, Jun 2008.
  8. S. Islam, U. Kjallquist, A. Moliner, P. Zajac, J.B. Fan, P. Lonnerberg, and S. Linnarsson. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res, 21(7):1160–1167, Jul 2011.
  9. J.W. Li and J.C. Vederas. Drug discovery and natural products: end of an era or an endless frontier? Science, 325(5937):161– 165, Jul 2009.
  10. N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers, J. McIndoo, K. Cook, A. Stepansky, D. Levy, D. Esposito, L. Muthuswamy, A. Krasnitz, W.R. McCombie, J. Hicks, and M. Wigler. Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90–94, Apr 2011.
  11. C. Rausch, T. Weber, O. Kohlbacher, W. Wohlleben, and D.H. Huson. Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res, 33(18):5799– 5808, 2005.
  12. S. Rodrigue, R.R. Malmstrom, A.M. Berlin, B.W. Birren, M.R. Henn, and S.W. Chisholm. Whole genome amplification and de novo assembly of single bacterial cells. PLoS One, 4(9):e6864, 2009.
  13. S.A. Sieber and M.A. Marahiel. Molecular mechanisms underlying nonribosomal peptide synthesis: approaches to new antibiotics. Chem Rev, 105(2):715–738, Feb 2005.

SPAdes: Path-sets extention

PATH-SETS: A Novel Approach for Comprehensive Utilization of Mate-Pairs in Genome Assembly

Son Pham**, Dmitry Antipov**, Alexander Sirotkin, Glenn Tesler, Pavel Pevzner and Max Alekseyev.

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

Source code will be posted after the paper is accepted.

 

 



** - joint first authors

Sonya Alexandrova

 
Education:
  • St. Petersburg University of the Russian Academy of Sciences, MSc in Software Engineering,  2009-2011
  • St. Petersburg State Technical University, BSc in Technical Physics, 2005-2009
  • St. Petersburg Classical Gymnasium, 2001-2005
  • Highland View Middle School, OR, USA, 2000-2001
 
Work experience:
  • Academic University Algorithmic Biology Lab, research fellow (October 2011 - present)
  • Internship at the Abagyan Lab, UCSD (July - August 2011)
  • Internship at the Baker Lab, UW (August - September 2011)
  • Yandex Inc., software engineering intern (June 2010 - June 2011)
 
Scientific Interests:
Structural biology, Bioinformatics, Graph theory

 

SPAdes 2.4

 

SPAdes AssemblerSPAdes manual with installation guide (ver 2.4.0)

Download SPAdes.

Support e-mail: spades.support@bioinf.spbau.ru

 

 

 

SPAdes 2.4 is out! 

See all changes in changelog.

 

For the benchmark we used:

E. coli K-12 MG1655 reference length is 4639675 with 4324 annotated genes. Only contigs of 500bp and longer were taken in consideration.

 

Assembly NG50 # contigs Largest contig Total length # misassemblies # mismatches per 100 kbp # indels per 100 kbp Genome fraction (%) # genes
Single-cell E. coli                  
A5 14399 745 101584 4441145 8 11.68 0.17 89.681 3439
ABySS 68534 179 178720 4345617 6 3.32 1.69 88.254 3703
CLC 32506 503 113285 4656964 2 5.54 1.43 92.211 3766
EULER-SR 26662 429 140518 4248713 17 10.85 35.69 84.856 3416
Ray 55395 296 210612 4649552 14 6.08 0.61 91.771 3826
SOAPdenovo 18468 569 87533 4098032 7 116.37 7.48 79.807 3037
Velvet 22648 261 132865 3501984 2 1.93 1.23 73.574 3072
E+V-SC 32051 344 132865 4540286 2 2.14 0.73 91.488 3759
IDBA1.1_contig 98306 244 284464 4814043 8 5.06 0.27 94.896 4035
IDBA1.1_scaffold 109057 229 284464 4813610 8 4.97 0.89 94.923 4040
SPAdes2.4_contigs 110539 277 269177 4877521 2 5.27 0.79 95.622 4047
SPAdes2.4_scaffolds 112120 250 269177 4910892 4 6.58 1.33 95.698 4055
                   
Isolate E. coli                  
A5 43651 176 181690 4551797 0 0.26 0.11 97.787 4154
ABySS 106155 96 221861 4619631 2 3.66 0.41 98.871 4239
CLC 86964 112 221549 4550314 1 1.79 0.31 97.799 4186
EULER-SR 110153 100 221409 4574240 8 2.49 10.15 97.846 4180
Ray 83128 113 221942 4563341 2 2.18 0.18 97.937 4185
SOAPdenovo 62512 141 172567 4519621 0 27.26 4.69 97.345 4134
Velvet 82776 120 242032 4554702 3 2.36 0.37 97.864 4185
E+V-SC 54856 171 166115 4539639 0 1.26 0.13 97.465 4124
IDBA1.1_contig 106844 110 221687 4565529 3 2.99 0.31 97.992 4195
IDBA1.1_scaffold 133098 93 284363 4565454 4 3.61 0.59 98.021 4204
SPAdes2.4_contigs 134076 97 285228 4634583 2 2.99 0.57 98.916 4245
SPAdes2.4_scaffolds 134076 97 285228 4635776 2 3.92 0.59 98.937 4245
 
ABySS 1.3.4, EULER-SR 2.0.1, Ray 2.0.0, Velvet, and E+V-SC were run with vertex size 55. A5 and CLC 3.22.55708 were run with default parameters. SOAPdenovo 1.0.4 was run with vertex size 27–31. IDBA 1.1.0 was run in its default iterative mode. The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli genome covered filters out these issues (Genome fraction (%) column). The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. Misassemblies are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference. Mismatch (substitution) error rate and number of indels are measured in aligned regions of the contigs. In each column, the best assembler by that criteria is indicated in bold.
 
 

Related publications

  • Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021

  • Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly. Journal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

  • Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome Assembly. Lecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
  • Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

 

Acknowledgements

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

Bringing Modern Science to Russian High Schools

Our lab is taking challenge to organize a series of lectures by the world's leading scientists (in all areas of science) for Russian high school students.

Since Soviet times, Russian high schools for gifted students have been outstanding educational organizations, where brightest kids have been achieving highest possible results in mathematics, physics, and other disciplines. Many modern Russian scientists are alumni of such schools (consider Fields medalists Grigory Perelman and Stanislav Smirnov). Think of Kolmogorov who was the founder and worked as a math teacher in the 18th boarding school in Moscow (now named after him).

We aim to bring together the educational talent of the distinguished scientists and the passion of the most talented high school students in Russia!

Our laboratory has now setup contacts with 4 top Russian schools for gifted children (two in Moscow and two in St. Petersburg), but there are places to further extend this list.

In St. Petersburg we're collaborating with:

In Moscow we're happy to work with:

To the scientists who are willing to give a high school lecture:

Our goal is to facilitate interactions between leading scientists and directors of elite Russian high schools with the goal to organize lectures for high school students.

We're inviting scientists in various areas of research to take part in this initiative. 

The topics of the lectures are not limited to mathematics and physics but also include biology, chemistry, engineering and many other disciplines.

Our goal is to bring the winners of Russian government  "megagrants" program to this project.

If you're in Moscow or St. Petersburg and you have a few spare hours, consider giving a lecture to top high school students. They will appreciate.

Write an e-mail to Mikhail Dvorkin at the Algorithmic Biology Lab (mikhail.dvorkin@gmail.com) and simply specify the time slots you are available: we will facilitate the scheduling and logistics issues for you.

Yuri Zemlyanskiy

Personal

VK: http://www.vk.com/urikz

E-mail: yuri.zemlyanskiy@gmail.com

Career

  • Academic University Algorithmic Biology Lab, intern (2011 - present)
  • Yandex, JAVA-developer (2009-2011)
  • Academic Gymnasium of Saint-Petersburg State University, computer science and math teacher (2010-2011)

Experience

Education

  • Department of Mathematics and Mechanics, St. Petersburg State University, St. Petersburg, Russia (2007 - present)
  • Academic Gymnasium of Saint-Petersburg State University (2005-2007)

 

Syndicate content