Public

Our research in computational proteomics mainly lies in the area of top-down mass spectrometry, which is a novel highly promising technology for acquiring mass spectra. In contrast to the traditional bottom-up approach, it does not require protein digestion prior to tandem mass spectrometry step. Analysis of intact proteins offers certain advantages, such as possibilities to detect post-translational modifications in a coordinated fashion and to identify multiple protein species.

Researchers

Kira Vyatkina

Alumni

Sonya Alexandrova

Mikhail Dvorkin

Yakov Sirotkin

Interns (Summer 2011):

Maxim Gladkikh

Yuri Zemlyanskiy

Andrey Lushnikov

Ilya Makeev

Student (Fall 2011):

Ksenia Krasheninnikova

Current projects

Tag generation for top-down mass spectra

(joint project with Pavel Pevzner’s lab at UCSD)

A peptide sequence tag (PST) is a short sequence of amino acids. In bottom-up mass spectrometry, PSTs are successfully used for spectra interpretation; however, in the top-down case, possibilities of their generation and usage have not yet been explored sufficiently. In the frame of this project, we propose and analyze methods of PST generation for top-down spectra, and indicate their potential applications to spectra identification and mixed spectra interpretation.

Paper:

Yakov Sirotkin, Xiaowen Liu, Maxim Gladkikh, Pavel Pevzner and Kira Vyatkina, “Peptide Sequence Tags for Top-Down Spectra”. (accepted to RECOMB CP 2012)

Software:

MS-Align+Tag (download)

Error correction for top-down mass spectra

(joint project with Pavel Pevzner’s lab at UCSD)

The procedure of spectrum interpretation starts with retrieval of isotopomer envelopes from a given spectrum, followed by derivation of monoisitopic masses from those envelopes. As a result, we obtain a deconvoluted spectrum. However, ±1Da errors are often observed in the masses composing deconvoluted spectra, which can impose serious problems in subsequent spectrum identification. The goal of this project is to eliminate this kind of errors.

Interpretation of mass spectra of substances resulting from chemical experiments

(joint project with Laboratory of Nanobiotechnologies, Academic University, headed by Corr. Mem. of RAS M.V. Dubina)

The goal of this project is to interpret mass spectra of substances, which are expected to contain peptides. Such hypothesis can be confirmed by retrieving an alphabet of amino acids composing the peptides present in a substance, and further explaining the given mass spectra.

Interpretation of multiplex mass spectra

Some mass spectra turn out to be produced from a mixture of proteins rather than from a single protein. They are usually referred to as mixed, or multiplex. This project aims to find a method for interpreting such spectra.

Completed projects

Protein identification using top-down spectra

(joint project with Pavel Pevzner’s lab at UCSD)

This project was devoted to development of a fast method for top-down protein identification, which allows searching for unexpected post-translational modifications. The proposed algorithm, MS-Align+, performs significantly better than previously existing approaches on two top-down datasets used for benchmarking such software tools.

Paper:

Xiaowen Liu, Yakov Sirotkin, Yufeng Shen, Gordon Anderson, Yihsuan S. Tsai, Ying S. Ting, David R. Goodlett, Richard D. Smith, Vineet Bafna and Pavel A. Pevzner, “Protein identification using top-down spectra”. “Molecular and Cellular Proteomics”. 2011 Oct 25. [Epub ahead of print]

Protein morphing

(joint project with Pavel Pevzner’s lab at UCSD, Burnham Institute for Medical Research, and Joint Center for Structural Genomics, Bioinformatics Core)

Within this project, we developed an efficient algorithm for protein morphing based on linear interpolation and implemented it as a web server.

Paper:

Natalie E. Castellana, Andrey Lushnikov, Piotr Rotkiewicz, Natasha Sefcovic, Pavel A. Pevzner, and Adam Godzik, Kira Vyatkina, “MORPH-PRO: A Novel Algorithm and Web Server for Protein Morphing”. In Proc. The 12th workshop on Algorithms in Bioinformatics (WABI 2012), September 10-12, Ljubljana, Slovenia, LNCS 7534, Springer, 2012, 12pp. (to appear) (Appendix)

Collaboration

Our research in computational mass spectrometry is carried out in the frame of close collaboration with Pacific Northwest National Laboratory (PNNL).

BayesHammer

BayesHammer: Bayesian Clustering for Error Correction in Single-Cell Sequencing

Sergey I. Nikolenko, Anton I. Korobeynikov, Max A. Alekseyev

BMC Genomics 2013, 14(Suppl 1):S7

Error correction for sequenced reads remains difficult, especially for single-cell sequencing projects with extremely non-uniform coverage. We present the BayesHammer error correction tool that uses Bayesian subclustering to correct sequencing reads. While BayesHammer was designed for single-cell sequencing, we demonstrate that it also improves on state-of-the-art error correction tools for standard (multi-cell) sequencing data.

Русский

What is Single Cell Genomics?

Most bacteria in environments ranging from the human body to the ocean cannot be cloned in the laboratory and thus cannot be sequenced using existing Next Generation Sequencing (NGS) technologies. This represents the key bottleneck for various projects ranging from the Human Microbiome Project (HMP) [3, 6] to antibiotics discovery [9]. For example, the key question in the Human Microbiome Project is how bacteria interact with each other. These interactions are often conducted by various peptides that are produced either for communication with other bacteria or for killing them. However, peptidomics studies of the human microbiome are now limited since mass spectrometry (the key technology for such studies) requires knowledge of fairly complete proteomes. On the other hand, while studies of new peptide antibiotics would greatly benefit from DNA sequencing of genes coding for Non-Ribosomal Peptide Syntetases (NRPS) [11, 13], existing metagenomics approaches are unable to sequence these exceptionally long genes (over 60,000 nucleotides).

HMP and discovery of new antibiotics are just two examples of many projects that would be revolutionized by Single Cell Sequencing (SCS). Recent improvements in both experimental [4, 7, 8, 10] and computational [1] aspects of SCS have opened the possibility of sequencing bacterial genomes from single cells. In particular, [1] demonstrated that SCS can capture a large number of genes, sufficient for inferring the organism’s metabolism. In many applications (including proteomics and antibiotics discovery), having a great majority of genes captured is almost as useful as having complete genomes.

Currently, Multiple Displacement Amplification (MDA) is the dominant technology for single cell amplification [2]. However, MDA introduces extreme amplification bias (orders-of-magnitude difference in coverage between different regions) and gives rise to chimeric reads and read-pairs that complicate the ensuing assembly.1 Acknowledging the fact that existing assemblers were not designed to handle these complications, Rodrigue et al., 2009 [12] remarked that the challenges facing SCS are increasingly computational rather than experimental. A recent paper [5] illustrates that existing assemblers produce inferior results for single cell projects even when the goal is to assemble a single NRPS, let alone a complete genome.

Chitsaz et al., 2011 [1] introduced the E+V-SC assembler, combining parts of EULER-SR with a modified Velvet, and achieved a significant improvement in the quality of SCS. However, as the authors of E+V-SC realized, one needs to change algorithmic design (rather than just modify existing tools like Velvet) to fully utilize the potential of SCS.

We present the SPAdes assembler, introducing a number of new algorithmic solutions and improving on state-of-the-art assemblers for both SCS and standard (multicell) bacterial datasets.

References:

H. Chitsaz, J.L. Yee-Greenbaum, G. Tesler, M.J. Lombardo, C.L. Dupont, J.H. Badger, M. Novotny, D.B. Rusch, L.J. Fraser, N.A. Gormley, O. Schulz-Trieglaff, G.P. Smith, D.J. Evers, P.A. Pevzner, and R.S. Lasken. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol, 29(10):915–921, 2011.
F.B. Dean, J.R. Nelson, T.L. Giesler, and R.S. Lasken. Rapid amplification of plasmid and phage DNA using phi 29 DNA polymerase and multiply-primed rolling circle amplification. Genome Res, 11(6):1095–1099, Jun 2001.
S.R. Gill, M. Pop, R.T. Deboy, P.B. Eckburg, P.J. Turnbaugh, B.S. Samuel, J.I. Gordon, D.A. Relman, C.M. Fraser-Liggett, and K.E. Nelson. Metagenomic analysis of the human distal gut microbiome. Science, 312(5778):1355–1359, Jun 2006.
J.P. Glotzbach, M. Januszyk, I.N. Vial, V.W. Wong, A. Gelbard, T. Kalisky, H. Thangarajah, M.T. Longaker, S.R. Quake, G. Chu, and G.C. Gurtner. An information theoretic, microfluidic-based single cell analysis permits identification of subpopulations among putatively homogeneous stem cells. PLoS One, 6(6):e21211, 2011.
R.V. Grindberg, T. Ishoey, D. Brinza, E. Esquenazi, R.C. Coates, W.T. Liu, L. Gerwick, P.C. Dorrestein, P. Pevzner, R. Lasken, and W.H. Gerwick. Single cell genome amplification accelerates identification of the apratoxin biosynthetic pathway from a complex microbial assemblage. PLoS One, 6(4):e18565, 2011.
M. Hamadyand, R. Knight. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res, 19(7):1141–1152, Jul 2009.
T. Ishoey, T. Woyke, R. Stepanauskas, M. Novotny, and R.S. Lasken. Genomic sequencing of single microbial cells from environmental samples. Current Opinion in Microbiology, 11(3):198–204, Jun 2008.
S. Islam, U. Kjallquist, A. Moliner, P. Zajac, J.B. Fan, P. Lonnerberg, and S. Linnarsson. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res, 21(7):1160–1167, Jul 2011.
J.W. Li and J.C. Vederas. Drug discovery and natural products: end of an era or an endless frontier? Science, 325(5937):161– 165, Jul 2009.
N. Navin, J. Kendall, J. Troge, P. Andrews, L. Rodgers, J. McIndoo, K. Cook, A. Stepansky, D. Levy, D. Esposito, L. Muthuswamy, A. Krasnitz, W.R. McCombie, J. Hicks, and M. Wigler. Tumour evolution inferred by single-cell sequencing. Nature, 472(7341):90–94, Apr 2011.
C. Rausch, T. Weber, O. Kohlbacher, W. Wohlleben, and D.H. Huson. Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res, 33(18):5799– 5808, 2005.
S. Rodrigue, R.R. Malmstrom, A.M. Berlin, B.W. Birren, M.R. Henn, and S.W. Chisholm. Whole genome amplification and de novo assembly of single bacterial cells. PLoS One, 4(9):e6864, 2009.
S.A. Sieber and M.A. Marahiel. Molecular mechanisms underlying nonribosomal peptide synthesis: approaches to new antibiotics. Chem Rev, 105(2):715–738, Feb 2005.

Русский

SPAdes: Path-sets extention

PATH-SETS: A Novel Approach for Comprehensive Utilization of Mate-Pairs in Genome Assembly

Son Pham**, Dmitry Antipov**, Alexander Sirotkin, Glenn Tesler, Pavel Pevzner and Max Alekseyev.

This work was supported by the Government of the Russian Federation (grant 11.G34.31.0018) and by the National Institutes of Health, USA (NIH grant 3P41RR024851-02S1). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project.

Source code will be posted after the paper is accepted.

** - joint first authors

Sonya Alexandrova

e-mail: sonya.alexandrova@gmail.com

Education:

St. Petersburg University of the Russian Academy of Sciences, MSc in Software Engineering, 2009-2011
St. Petersburg State Technical University, BSc in Technical Physics, 2005-2009
St. Petersburg Classical Gymnasium, 2001-2005
Highland View Middle School, OR, USA, 2000-2001

Work experience:

Academic University Algorithmic Biology Lab, research fellow (October 2011 - present)
Internship at the Abagyan Lab, UCSD (July - August 2011)
Internship at the Baker Lab, UW (August - September 2011)
Yandex Inc., software engineering intern (June 2010 - June 2011)

Scientific Interests:

Structural biology, Bioinformatics, Graph theory

SPAdes 2.4

SPAdes Assembler SPAdes manual with installation guide (ver 2.4.0)

Download SPAdes.

Support e-mail: spades.support@bioinf.spbau.ru

Follow @spadesassembler

SPAdes 2.4 is out!

See all changes in changelog.

For the benchmark we used:

MDA single-cell E.coli; 6.3 Gb, 29M reads, 2x100bp, insert size ~ 270bp (Illumina Genome Analyzer IIx)
Standard isolate E.coli; 6.2Gb, 28M reads, 2x100bp, insert size ~ 215bp (Illumina Genome Analyzer IIx)

E. coli K-12 MG1655 reference length is 4639675 with 4324 annotated genes. Only contigs of 500bp and longer were taken in consideration.

Assembly	NG50	# contigs	Largest contig	Total length	# misassemblies	# mismatches per 100 kbp	# indels per 100 kbp	Genome fraction (%)	# genes
Single-cell E. coli
A5	14399	745	101584	4441145	8	11.68	0.17	89.681	3439
ABySS	68534	179	178720	4345617	6	3.32	1.69	88.254	3703
CLC	32506	503	113285	4656964	2	5.54	1.43	92.211	3766
EULER-SR	26662	429	140518	4248713	17	10.85	35.69	84.856	3416
Ray	55395	296	210612	4649552	14	6.08	0.61	91.771	3826
SOAPdenovo	18468	569	87533	4098032	7	116.37	7.48	79.807	3037
Velvet	22648	261	132865	3501984	2	1.93	1.23	73.574	3072
E+V-SC	32051	344	132865	4540286	2	2.14	0.73	91.488	3759
IDBA1.1_contig	98306	244	284464	4814043	8	5.06	0.27	94.896	4035
IDBA1.1_scaffold	109057	229	284464	4813610	8	4.97	0.89	94.923	4040
SPAdes2.4_contigs	110539	277	269177	4877521	2	5.27	0.79	95.622	4047
SPAdes2.4_scaffolds	112120	250	269177	4910892	4	6.58	1.33	95.698	4055

Isolate E. coli
A5	43651	176	181690	4551797	0	0.26	0.11	97.787	4154
ABySS	106155	96	221861	4619631	2	3.66	0.41	98.871	4239
CLC	86964	112	221549	4550314	1	1.79	0.31	97.799	4186
EULER-SR	110153	100	221409	4574240	8	2.49	10.15	97.846	4180
Ray	83128	113	221942	4563341	2	2.18	0.18	97.937	4185
SOAPdenovo	62512	141	172567	4519621	0	27.26	4.69	97.345	4134
Velvet	82776	120	242032	4554702	3	2.36	0.37	97.864	4185
E+V-SC	54856	171	166115	4539639	0	1.26	0.13	97.465	4124
IDBA1.1_contig	106844	110	221687	4565529	3	2.99	0.31	97.992	4195
IDBA1.1_scaffold	133098	93	284363	4565454	4	3.61	0.59	98.021	4204
SPAdes2.4_contigs	134076	97	285228	4634583	2	2.99	0.57	98.916	4245
SPAdes2.4_scaffolds	134076	97	285228	4635776	2	3.92	0.59	98.937	4245

ABySS 1.3.4, EULER-SR 2.0.1, Ray 2.0.0, Velvet, and E+V-SC were run with vertex size 55. A5 and CLC 3.22.55708 were run with default parameters. SOAPdenovo 1.0.4 was run with vertex size 27–31. IDBA 1.1.0 was run in its default iterative mode. The total assembly size may increase (and in some cases exceeds the genome size) due to contaminants (see Chitsaz et al. (2011)), misassembled contigs, repeats, and hubs that contribute to multiple contigs. The percentage of the E. coli genome covered filters out these issues (Genome fraction (%) column). The NG50 statistic is the same as the N50 except that the genome size is used rather than the assembly size. Misassemblies are locations on an assembled contig where the left flanking sequence aligns over 1 kb away from the right flanking sequence on the reference. Mismatch (substitution) error rate and number of indels are measured in aligned regions of the contigs. In each column, the best assembler by that criteria is indicated in bold.

Related publications

Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A. Gurevich, Mikhail Dvorkin, Alexander S. Kulikov, Valery M. Lesin, Sergey I. Nikolenko, Son Pham, Andrey D. Prjibelski, Alexey V. Pyshkin, Alexander V. Sirotkin, Nikolay Vyahhi, Glenn Tesler, Max A. Alekseyev, and Pavel A. Pevzner. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19(5) (2012), 455-477. doi:10.1089/cmb.2012.0021
Son K. Pham, Dmitry Antipov, Alexander Sirotkin, Glenn Tesler, Pavel A. Pevzner, and Max A. Alekseyev. Pathset Graphs: A Novel Approach for Comprehensive Utilization of Paired Reads in Genome Assembly. Journal of Computational Biology (2012). doi:10.1089/cmb.2012.0098

Nikolay Vyahhi, Son K. Pham, and Pavel A. Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome Assembly. Lecture Notes in Bioinformatics 7534 (2012), pp. 249-261. doi:10.1007/978-3-642-33122-0_20
Sergey I. Nikolenko, Anton I. Korobeynikov and Max. A. Alekseyev. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics (2013) 14(S1):S7. doi:10.1186/1471-2164-14-S1-S7

Acknowledgements

Русский

Bringing Modern Science to Russian High Schools

Our lab is taking challenge to organize a series of lectures by the world's leading scientists (in all areas of science) for Russian high school students.

Since Soviet times, Russian high schools for gifted students have been outstanding educational organizations, where brightest kids have been achieving highest possible results in mathematics, physics, and other disciplines. Many modern Russian scientists are alumni of such schools (consider Fields medalists Grigory Perelman and Stanislav Smirnov). Think of Kolmogorov who was the founder and worked as a math teacher in the 18th boarding school in Moscow (now named after him).

We aim to bring together the educational talent of the distinguished scientists and the passion of the most talented high school students in Russia!

Our laboratory has now setup contacts with 4 top Russian schools for gifted children (two in Moscow and two in St. Petersburg), but there are places to further extend this list.

In St. Petersburg we're collaborating with:

Lyceum "Physical-Technical High School" (director Mikhail Georgievich Ivanov)
Lyceum 239 (director Maksim Yakovlevich Pratusevich)

In Moscow we're happy to work with:

Kolmogorov's lyceum (director Anatoly Aleksandrovich Chasovskih)
High school 57 (vice director Boris Mikhailovich Davidovich)

To the scientists who are willing to give a high school lecture:

Our goal is to facilitate interactions between leading scientists and directors of elite Russian high schools with the goal to organize lectures for high school students.

We're inviting scientists in various areas of research to take part in this initiative.

The topics of the lectures are not limited to mathematics and physics but also include biology, chemistry, engineering and many other disciplines.

Our goal is to bring the winners of Russian government "megagrants" program to this project.

If you're in Moscow or St. Petersburg and you have a few spare hours, consider giving a lecture to top high school students. They will appreciate.

Write an e-mail to Mikhail Dvorkin at the Algorithmic Biology Lab (mikhail.dvorkin@gmail.com) and simply specify the time slots you are available: we will facilitate the scheduling and logistics issues for you.

Yuri Zemlyanskiy

Personal

VK: http://www.vk.com/urikz

E-mail: yuri.zemlyanskiy@gmail.com

Career

Academic University Algorithmic Biology Lab, intern (2011 - present)
Yandex, JAVA-developer (2009-2011)
Academic Gymnasium of Saint-Petersburg State University, computer science and math teacher (2010-2011)

Experience

Microsoft Data Structures And Algorithms School (2010).

Education

Department of Mathematics and Mechanics, St. Petersburg State University, St. Petersburg, Russia (2007 - present)
Academic Gymnasium of Saint-Petersburg State University (2005-2007)

Public

Logistics

RECOMB Algorithmic Biology

Computational Proteomics

Researchers

Alumni

Tag generation for top-down mass spectra

Error correction for top-down mass spectra

Interpretation of mass spectra of substances resulting from chemical experiments

Interpretation of multiplex mass spectra

Completed projects

Protein identification using top-down spectra

Protein morphing

BayesHammer

What is Single Cell Genomics?

SPAdes: Path-sets extention

Sonya Alexandrova

SPAdes 2.4

Bringing Modern Science to Russian High Schools

Yuri Zemlyanskiy

Career