Antibodies are proteins produced by the body’s immune system in response to antigens – potentially harmful substances. They are formed of four polypeptide chains: two identical heavy chains, and two identical light chains. A heavy chain is composed of four gene segments: V (variable), D (diverse), J (join), and C (constant); similarly, a light chain consists of three gene segments: V, J, and C. Antibodies are not encoded directly in the genome, but are assembled from those gene segments, each chosen from hundreds of candidates. Moreover, some nucleotides may be inserted or deleted at the junctions, increasing antibody diversity, and somatic hypermutation further diversifies the antibody repertoire.
The effectiveness of an antibody in blocking a particular antigen strongly depends on its amino acid sequence, as well as on the presence (or absence) of certain modifications. This makes the task of antibody sequencing highly important. However, due to their diversity, no complete antibody database exists. As a consequence, MS/MS database search approaches to protein sequencing are inapplicable to this case, leaving de novo sequencing the most attractive alternative.
Just a few years ago, sequencing a single antibody represented a heroic effort. “Digitizing" the $25 billion antibody industry forms an important goal because antibodies act as key diagnostic and therapeutic agents. Our experimental collaborators anticipate that as soon as the cost of antibody sequencing drops below $1,000, most diagnostic and therapeutic antibodies will be routinely sequenced. This flurry of sequencing thousands of antibodies will necessarily lead to digitization throughout the industry, a task requiring advanced software tools. Also, future applications will focus on previously unsequenced polyclonal antibodies. This research, if successful, would lead to disruptive computational technology in the antibody industry.
At the first stage of this project, we have developed a de Bruijn graph approach for the de novo assembly of thousands of top-down spectra into a protein sequence.