Calculating Genetic Distances between Protein Sequences

Sequence alignment is most interesting active area in the field of bioinformatics. An important step in the analysis of nucleotide or amino acid sequence alignment is calculation of the genetic distances or evolutionary distances matrix between the pair sequences. It has been noted that evolutionary changes act of DNA sequence and the sequence changes happens in the time course. Two sequences deriving from a common ancestor independently evolve and diverge later, and the measure of this divergence is termed as genetic distance and plays many roles in sequence analysis in bioinformatics and molecular biology. Genetic distances is linearly proportional to the elapsed time and measures sequence similarity. Estimating evolutionary distances between biological sequences is important to construct phylogenetic tress and understanding gene and protein evolution strategies. Considering evolutionary tree for sequences, length of the branch denotes distances between the sequences (nodes) in the tree and it has been noted that genetic divergence provides evolutionary tree relating to particular sequences.  

There are various computational approaches available to predict structural and functional properties of biological sequence. Generally two types of sequence alignment were performed namely pairwise sequence alignment (PSA) and multiple sequence alignment (MSA). Pairwise alignment considers comparing two sequences at a time while multiple sequence alignment focuses on aligning more than two related sequences that considers multiple sequence alignment more useful compared to pairwise alignment as it aligns multiple members of a sequence family and give access to more biological information. This indicated multiple sequence alignment as a prerequisite to measure genomic analysis for identification of conserved regions and functional motifs and ancestor sequence profiling. Sequence alignment of amino acids were at highest priority compared to nucleotide level alignments, as proteins are the building blocks and functional biological molecules that carry structural and functional information and strongly connected to structural biology aspects. Considering the multidisciplinary approach, it has been reported in literature as multiple sequence alignment provides an open window for studying evolutionary relationships, functional properties, and structural perspective of biological macromolecules in a more concise manner. Scoring matrices are calculated in sequence alignments to identify similarity and dissimilarities between the biological sequences of interest. In amino acid sequence alignment, similarity score is counted along with identity score which denotes the amino acids sharing similar physicochemical properties. PAM and BLOSSUM were the substitution matrices employed for protein sequence alignment.  

In multiple sequence alignment, a set of homologous amino acid or nucleotide sequences are arranged in a matrix in which column represents homologous characters and functionally related protein structure. Multiple sequence alignment performance depends on its ability to assess reference alignment producing biological information regarding conserved structures. Obtained output score can be compared with score of reference alignment, known as Sum-of-Pairs (SP) score, with fraction of residue pairs in reference alignment or by Total Column (TC) score, describing the identified fraction of reference columns. While aligning the sequence, the ‘Symmetrized SP’(SSP) recoding ignores the gaps and treats them as blanks in an alignment. The name indicates the similarities of the existing SP method for comparing alignments. Seq indicates the recoding and provides a simple record of gap information and treats all gaps in a sequence equally. Consecutive gap positions were the product of a single insertion or deletion mutation. pos recoding incorporates the positional information gap occurance in a sequence, but without provide the details of temporal phylogenetic location of the gap produced. evol recoding includes necessary information of pos recoding, and also incorporates indel event leading to that gap occurs in a phylogenetic tree. It is noted that the variation in alignment quality have a significant impact on the genetic diversity that is calculated between the same pairs of sequences.

 

Genetic Distance Estimation Using R  

Phylogenetic analyses on a genomic scale to addresses the issues ranging from the prediction of gene and protein function to organismal relationships. In a broader perspective, computing the relatedness of organisms either by phylogenetic gene by gene analyses or phylogenomic whole genome comparison methods reveals high-quality results for demonstrating phylogenies. In bioinformatics, Phylip (Phylogeny Inference Package) software is a free package of programs for inferring phylogenies of living species and organisms.

  Calculating genetic distances between protein sequences  

Using R programming, we can calculate the genetic distances between protein sequences using the “dist.alignment()” function in the SeqinR package. The dist.alignment() function takes a multiple alignment as input. Based on the multiple alignment t, dist.alignment() calculates the genetic distance between each pair of proteins in the multiple alignment.

  Consider example, Uniprot accession number P06747, rabies virus phosphoprotein, P0C569, Mokola virus phosphoprotein, O56773, Lagos bat virus phosphoprotein and Q5VKP1 is Western Caucasian bat virus phosphoprotein. Based on the genetic distance matrix above, it is noted that genetic distance between Lagos bat virus phosphoprotein (O56773) and Mokola virus phosphoprotein (P0C569) is about 0.414 and is the smallest value. Similarly, the genetic distance between Western Caucasian bat virus phosphoprotein (Q5VKP1) and Lagos bat virus phosphoprotein (O56773) is about 0.507,the biggest. The larger the genetic distance between two sequences, the more amino acid changes or indels that have occurred since they shared a common ancestor, and the longer ago their common ancestor probably lived (Fig.1).

image

  Multiple sequence alignment for finding genetic distance between proteins of multiple organisms