Immunoproteogenomics: analysis of antibody repertoire

What is antibody repertoire?

Antibody repertoire is a set of curculating antibodies. Reconstruction of antibody repertoire is important step of antibody drug development. We present a collection of tools for investigating antibody repertoire based on immunosequencing data:

IgRepertoireConstructor: an algorithm for construction of antibody repertoire and immunoproteogenomics analysis

IgSimulator: tool for simulation of antibody repertoire

IgQUAST: quality assessment tool for antibody repertoires (coming soon)

Antibody repertoire representation

We present an antibody repertoire as a set of clusters that correspond to antibody clones (groups of identical antibodies presenting by antibody nucleotide sequence, frequency and a set of Ig-Seq reads composing group). We use two files to describe antibody repertoire: CLUSTERS.FA (FASTA file containing antibody sequences) and RCM (Read-Cluster Map). Examples of CLUSTERS.FA and RCM files for toy repertoire are listed below.

CLUSTERS.FASTA is a FASTA file, where each sequence corresponds to the antibody clone.

Header of each sequence contains information about corresponding cluster id and size.

Example shows repertoire containing 3 clusters of sizes 3, 2, and 1.

Every line of RCM file contains information about read name and corresponding cluster id.

For example, cluster 1 contains of reads MISEQ@:53:000000000-A2BMW:1:2114:14345:28882,

MISEQ@:53:000000000-A2BMW:1:2114:14345:28882 and MISEQ@:53:000000000-A2BMW:1:2114:14393:28886.

IgRepertoireConstructor

IgRepertoireConstructor is a tool for construction of antibody repertoire from Illumina Ig-Seq library. IgRepertoireConstructor takes as an input immunosequencing reads that cover variable regions of antibodies and returns antibody repertoire constructed from the given reads as its output.

Visit IgRepertoireConstructor official page at GitHub for more details and download the latest version!

IgSimulator

IgSimulator is a tool for simulation of antibody repertoire and Ig-Seq library. IgSimulator is designed for testing and benchmarking tools for reconstruction of Ig repertoires.

Visit IgSimulator official page at GitHub for more details and download the latest version!

IgQUAST

IgQUAST (Immunoglobulin QUality ASsessment Tool) is a tool for quality assessment of antibody repertoire. IgQUAST takes antibody repertoire(s) as an input and evaluates them in the different ways:

Single repertoire evaluation
Multiple repertoires comparison
Quality assessment against an ideal repertoire

Single repertoire evaluation

IgQUAST computes basic metrics such as # clusters, # singletons (or clusters containing of single read), size of maximal cluster, average size of cluster and a set of metrics showing number of clusters in repertoire of size larger than thresholds (# clusters >= 10, # clusters >= 50, # clusters >= 100 etc) and draws plots, such as histogram of cluster size / length distribution:

Histogram of cluster size distribution Histogram of cluster length distribution

IgQUAST additionally performs advanced analysis of mutated groups (groups of antibodies possibly developed from the same antibody). Example of advanced analysis of IgQUAST is shown below:

(a) Example of visualization of two clusters alignment. Peaks correspond to positions of polymorphisms in alignment. Red bars correspond to positions of CDRs computed by IgBlast.

(b) Example of visualization of summarized alignment of cluster against similar clusters.

Multiple repertoire comparison

IgQUAST compares two or more repertoires constructed from the same Ig-Seq library and computed a set of metrics showing similarity of input repertoires.

General metrics for all compared repertoires

Metric name	Description
# ideal groups	Number of clusters that are identical in all input repertoires, i.e. have similar sequences and were combined by the same set of reads
# trusted groups	Number of groups where clusters from different repertoires have similar sequences and share >90% of reads. Such groups occur when cluster from one repertoire is presented by one big and several small clusters in other repertoires. These groups can be result of inaccurate error correction of one of input repertoires.
# untrusted groups	Number of groups where clusters from different repertoires have non-similar sequences and share >90% of reads. Existence of such groups indicates that at least one of cluster sequence from untrusted group is erroneous and should be reconstructed
# non-trivial ideal/trusted/untrusted groups	Ideal/trusted/untrusted groups where at least one cluster is not singleton.
# big untrusted groups	Number of groups of big clusters (only clusters of size at least as specified with option --isol-min-size) from different repertoires that have similar sequences and share >90% of reads.

Individual metrics for each repertoire

Metric name	Description
# isolated clusters	Number of clusters that presented in only one input repertoire and have no similar clusters in other repertoires.
# short clusters	Number of clusters with length of sequence <300 nt.
# short isolated clusters	Number of isolated clusters with length of sequence <300 nt.
min/avg/max cluster size	Minimal/average/maximal size of isolated cluster.
# trivial isolated clusters	Number of isolated singletons

IgQUAST reports various plots showing comparative histograms of cluster size / antibody length distribution for input repertoires:

Quality assessment against an ideal repertoire

IgQUAST evaluates repertoire with respect of ideal repertoire (e.g., in case of simulated repertoire) in terms of sensitivity (the measure of the representation of the ideal clusters by the constructed clusters) and specificity (the error rate of the incorrectly merged clusters of the ideal repertoire):

Metric name	Description
# original clusters	Number of clusters in ideal repertoire.
# not merged	Number of non-trivial clusters in the original repertoire that contain multiple clusters in the constructed repertoire. For a correctly constructed repertoire, the value of #this metric is 0.
# not merged (not trivial + singletons)	Number of not merged clusters that are formed by a single non-trivial cluster and a number of singletons in the constructed repertoire.
# original singletons	number of singletons in ideal repertoire.
max original cluster	Size of maximal cluster from ideal repertoire.
# constructed clusters	Number of constructed clusters.
# errors	Number of constructed clusters that contain reads from more than one original cluster. For the correctly constructed repertoire, this metric is 0.
# constructed singletons	Number of constructed singleton clusters.
max constructed cluster	size of maximal constructed cluster.
avg fill-in	The value of avg fill-in for an original cluster C is computed as the ratio of the size of its largest non-erroneous subcluster in the constructed repertoire to the size of C.
fill-in of max cluster	Maximal cluster of the original repertoire corresponds to the most frequent monoclonal antibodies. This metric is equal to the fill-in of the maximal original cluster.
correct singletons (%)	Some singletons in the constructed repertoire can be false due to insufficient error correction. This metric shows percentage of true singletons in the constructed repertoire.
used reads (%)	Percentage of reads used in the repertoire reconstruction. This metric shows how well the reads have been utilized for reconstructing repertoires.
#lost clusters	Number of original clusters that were completely lost in the constructed repertoire.
lost clusters size (%)	Percentage of the lost clusters size as compared to full size of original repertoire.
min/avg/max percentage of identity (%)	Minimal/average/maximal percentage of identity between sequences of clusters from original repertoire and corresponding clusters in constructed repertoire (corresponding cluster from constructed repertoire selected as a cluster that have most shared reads with cluster in original repertoire).