|Sunday August 26|
|Rosalind (Chair: Pavel Pevnzer)|
Nikolay Vyahhi, Phillip Compeau
I propose an open source consortium for bioinformatics teaching materials, including textbook chapters, slides, concept tests, homework and exam questions and answers, programming problems and data analysis projects, and software tools for using these materials in class and out.
To seed this effort I am contributing materials from two courses: a bioinformatics theory course (for Computer Science students) emphasizing probabilistic models and methods; and a genomics and computational biology course (for Life Science students).
This effort is based on several principles. First, bioinformatics is highly interdisciplinary, yet bioinformatics textbooks tend to each reflect only one disciplinary part of that. Furthermore, both available textbooks and the traditional lecture method fall far short of giving students adequate exercises to truly learn the concepts and skills. In effect, the job of writing all these teaching materials is too big for any one person. Instead, every teacher should be enabled to focus on writing materials in areas where they are expert, while drawing whatever materials they want from everyone else, via an open source consortium for sharing teaching materials.
Second, bioinformatics teaching should draw lessons from other fields such as physics teaching, where it has been shown that traditional lecturing (passive learning) is far less effective than active learning, where students answer and discuss problems in class. Specifically, I have developed teaching materials and software tools for in-class concept tests, defined as a question that challenges the students' understanding of a specific concept.
Whereas ROSALIND computational problems may be viewed as empirical (implicit) tests of mastery of a concept or skill, in-class concept testing explicitly teaches such mastery by challenging students to think about how to use a concept, and rapidly exposing the most common errors for all to see and understand. I illustrate with examples from the approximately 300 bioinformatics concept tests I have written for this effort.
I also present software tools for in-class concept testing, and for selecting and "re-compiling" content in flexible ways. Finally, I will discuss critical issues for such a consortium, such as automatic authorship tracking, sharing, and security.
For more details, see http://thinking.bioinformatics.ucla.edu/teaching.
|Rosalind Problem Presentations 1|
A genomic or proteomic sequence can be seen as composed of a number of possibly overlapping words of a certain length, and the composition of a sequence is given by the frequency with which each possible word occurs within the sequence. In this talk, we review the biological significance of sequence composition and discuss efficient methods to obtain the word composition of a sequence, along with their implementation in the framework of the ROSALIND programming and testing environment for bioinformatics problems.
Tomas Vinar, Brona Brejova
Here, we describe three problems that we have previously used in the context of a bioinformatics class taught at the Comenius University in Bratislava. The class is targeted at both computer science and biology students. Students with both backgrounds attend the same lectures, while tutorials and assignments are provided separately for biologists and computer scientists.
One particular challenge in teaching this course is to design assignments for biology students, illustrating basic algorithmic and mathematical concepts used in bioinformatics without requiring prior programming experience. The class does not require any previous programming courses, nor it is the goal of the class to teach programming. We have found that many concepts can be illustrated in a standard spreadsheet (MS Excel or one of its open-source equivalents) to which most of the students have been exposed previously.
|Rosalind Problem Presentations 2|
Recent advances in sequencing technology have enabled scientists to gather large amounts of DNA and RNA sequence data. One of the bioinformatics challenges is extracting new insights about the structure and function of biomolecules from the wealth of sequence data. In this talk, we look at two problems in the field of computational molecular biology designed to stimulate and challenge students. The first problem relates to understanding the secondary structure of an RNA molecule based on its primary sequence. The second problem relates to processing large amounts of DNA sequence data so as to capture the internal structure in the data and support a range of queries on the data efficiently. Applications will be discussed for aligning high-throughput sequencing reads to a genome and for screening a genome for interesting genetic elements such as CRISPRs.
The basic task for molecular evolution studies is to calculate the frequency of a particular event in the evolutionary history. Reversing substitution is an example of such molecular event. At some moment in the past the direct amino acid substitution A → B occured. And after a certain period of time, we observe the reversing substitution B → A. Unfortunately, in most cases, with the possible exception of experimental evolution in bacteria, we don't know the intermediate (ancestral) state of a protein. We can observe proteins in human, mouse, dog, elephant and other species in their current state in the form of the multiple alignement of orthologous protein-coding genes. But we can restore the ancestral states in the internal nodes of the phylogenetic tree using the knowledge of amino acids on the terminal branches of the tree and the tree topology itself. There are a variety of methods (maximum parsimony, maximum likelihood, bayesian methods) and programs (PAML, Phylip, PAUP) to do so. Using the ancestral and terminal aminoacids at a site we can infer the substitutions.
Problem. Given the multiple alignment with internal states restored and the phylogenetic tree it is necessary to calculate the number of reversing substitution for different distances between the direct and reversing subsitution.
The solution of this problem does not require the intelligent algorithm, but it is an example (simplified) of the real world problem in molecular evolution. It contains the basic concepts: the site, the phylogenetic tree, the multiple alignment, the correspondance between these two, the inference of substitution events.
The shape of education is changing from strictly classroom-based learning to encompassing online learning, either as auxiliary learning tools or as a complete learning environment in its own right. Finding that hands-on training, while very useful, does not meet the demand for courses, the European Bioinformatics Institute has developed a Train-on-line site to provide a series of bioinformatics courses to a wider audience. Online learning can be particularly useful for bioinformatics courses, where students often have diverse backgrounds, as it permits students with similar learning needs to link-up. The EBI plans to promote this through the use of subject-focused online Forums, where experts will be able to link directly with groups of students.
To be successful, online learning needs good visibility. One approach is to connect with the efforts of Wikipedia, Wikiversity and Wikibooks. EBI online courses link glossary terms to Wikipedia, and plan to link terms back from Wikipedia to online courses, such as to modules covering EBI databases that have entries in Wikipedia. Courses can also be place on Wikiversity for greater accessibility to the public.
A second major change in the education system is an online environment for teachers, where they can share materials thereby improving the quality of classroom-based learning and helping to provide education standards. The Bioinformatics Training Network is one such site, a community-based project that aims to provide a centralised facility to share materials, to list training events (including course content) and to discuss training experiences. The site was developed and is maintained by those active in the field of bioinformatics education from any country worldwide.
Discussion Panel 1: How do we teach bioinformatics to 10,000 students at the same time?
The scalability of bioinformatics education is a question of the utmost importance in the next decade. Everyone interested in taking part in will be formed into small groups to discuss a number of questions related to this central theme.
|Bioinformatics for Biologists (Chair: Ron Shamir)|
The identification of transcription factor binding sites is an important step in understanding the regulation of gene expression. To address this need, many motif-finding tools have been described that can find short sequence motifs given only an input set of sequences.
Somewhat surprisingly, development of the significance analysis of the motifs reported by those motif finders has lagged considerably behind the extensive development of the finders themselves. Nevertheless, this analysis is often crucial in helping scientists decide whether or not to carry the predicted motifs to the next stage of their analysis. We will discuss the problem of evaluating the statistical significance of sequence motifs in the general context of evaluating the statistical significance of an observed result.
|B4B Chapter Proposals|
Prediction of "protein sorting", i.e. the subcellular location of proteins, has become a major task in bioinformatics. The problem is easy to formulate and understand from a biological point of view, yet the computational solutions are often complex and involve several machine learning methods. Thus, protein sorting is a well suited case for introducing sequence-based machine learning methods for biologists.
Methods for predicting protein sorting from the amino acid sequence can roughly be divided into three types: Homology-based methods that rely on alignment to proteins with known location; signal-based methods that attempt to recognize the actual sorting signals; and global property methods that utilize the fact that proteins from different subcellular compartments differ in amino acid composition or other global properties of the sequence.
In my presentation, I will focus on two very important sorting signals, the signal peptide and the transmembrane helix, and show how two machine learning methods, artificial neural networks and hidden Markov models, have been successfully applied in their recognition. In addition, I will briefly mention issues of training set / test set division and overfitting, which apply to all types of machine learning and are important to understand even for the casual user of such methods.
Mobile elements constitute large portions of eukaryotic genomes. They are sequences that are often replicated, and replicated copies then undergo evolution separately; thus, by considering a family of mobile (repeat) elements one can deduce evolutionary history, leading to important insights on species relations, population structure etc. We consider probabilistic modeling of mobile elements phylogeny, starting from the simplest statistical considerations and then proceeding to more complicated models.
Human genetic and metabolic diversity is heavily influenced by complex microbial communities that inhabit the human body. The microbiota is highly variable both within and between people in body habitats such as the gut, skin, and oral cavity, and changes in the microbiota can cause or prevent disease. In this talk, we discuss the biological problem of comparing microbial communities across people, body habitats, health conditions, and time, along with the related computational problems of designing taxonomically universal PCR primers, determining and quantifying the composition of environmental samples, and comparing abundance profiles across microbial communities.
In this lecture I will present the algorithmic challenges presented by two novel types of sequencing technologies: the SOLiD system, which generates color-space reads, and Single-Molecule Sequencing systems, which have an extremely high indel error rate, but can read each piece of DNA two or more times. I will then explain how classical string alignment algorithms must be adopted to deal with this type of data, in particular explaining the generalization of sequence alignment to the Weighted Sequence Graph abstraction, and showing how this can be further adopted to work with color-space data.
Discussion Panel 2: How do we teach bioinformatics to 10,000 students at the same time?
Discussion participants will reconvene to gather the best ideas from each of the morning discussion groups.