Computational-based Approaches to Protein Function Prediction

The most effective way of determining protein structure from sequence alone is by using computational methods. This is based on direct comparison of the unknown protein sequence with the sequence of a protein whose structure is known.

Promotive | Shutterstock

Over the past 40 years, the number of gene and protein libraries available for DNA sequencing has expanded rapidly, and this is only expected to increase as the field advances. Knowledge of protein sequences enables the study of the evolutionary history and relationships among organisms – called phylogenetics – and this contributes significantly to the field of evolutionary biology.

Protein function also facilitates the understanding of biological behavior exhibited by organisms. However, protein function is difficult to research and involves both in vitro and in vivo approaches. Currently, the knowledge of protein sequences surpasses the knowledge of the function they encode. The process of assigning function to sequence is called annotation.

Can existing proteins be used to determine the function of an unknown protein?

In some cases, function can be inferred from sequence. This is true if the two proteins, one of unknown function, and one of characterized function, have a percentage sequence identity of approximately 40%.

If the sequence identity includes residues that are directly implicated in biochemical function (for example, those in the active site of an enzyme) then a highly probable prediction of function can be made.

Sequence comparison cannot be used when two proteins are distantly related in evolutionary terms, regardless of whether they are structurally and functionally identical. Moreover, identity of biochemical function does not necessarily mean that the functions of the proteins will be similar.

How can homologous sequences be used to identify protein function?

Unknown genes from newly sequenced genomes can be identified by searching for similar sequences, called homologous sequences, in databases of known gene and protein sequences using computer programs. Two such notable databases are BLAST and FASTA.

The way in which homologous sequences are identified is through alignment, which is a process of lining up the residues of the two sequences. The alignment process involves searching the two compared sequences for continuous sections of identical residue identity or similarity.

It is common for sequences to differ in amino acids at specific positions due to evolution. These mismatches arise from insertions, or deletions that could have arisen in either sequence, which would have then resulted in their divergence with regards to their amino acid sequences.

The similarity between the amino acids across a certain region of sequence can be used to determine the degree of conservation in that region. The databases produce a quantitative measure through a parameter that denotes the probability that the match would occur by chance. This parameter is called the E-value. A smaller E-value is correlated with an increased sequence match. This suggests that the similarity observed is not due to chance, but has arisen due to evolutionary divergence.

The alignment process can be expanded to give a multiple sequence alignment. This describes the process of aligning three or more sequences. It is necessary, because pairwise alignment (achieved by comparing only two sequences) is not reliable.

Pairwise alignment, which depends on the scoring function used by the database, does not allow the user to determine which sequence is the most similar if more than one homolog, with equivalent scores, is identified.

Alternatively, local alignments can identify amino-acid sequence patterns, called motifs, that are responsible for the function of the proteins.

Identifying function from higher-order folding

Structural data comparison can identify regions of sequence conservation not present in the primary sequence. This is because it compares secondary and tertiary structures, and so enables proteins with similar sequences, but very different structures to be excluded as candidates.

Structural comparison also identifies stable, evolutionary conserved parts of a protein, called domains. For example, small functional protein domains, such as SH2, SH3, PH, HTH and EF can be identified in this way.

Secondary structures, such as alpha helices and beta sheets can be predicted as the amino acids that comprise them them have a limited number of possible angles they can occupy due to the small space available to them in these structures.

The amount of space available to each amino acid is referred to as steric constraint. This allows characteristics in the amino acid sequence that comprise these structures to be easily identified. For example, the sequences that fold to produce an alpha helix rarely contain the amino acids glycine and proline, as these residues exert a destabilising effect on the structure. Proline is too rigid to occupy the correct angle, whilst glycine is more energetically stable in an unfolded portion of a protein.

Recognition of tertiary folding can also be achieved using similar algorithms employed in secondary structure detection. Methods to determine this include:

Statistical methods based on studies of databases of known protein structures from which structural propensities for all amino acids are calculated,
Physico-chemical methods that use knowledge of amino acid characteristics, such as their favorable angles, solubility and energies, or
Hybrid methods that combine the latter two methods. In the absence of appreciable sequence identity, profile-based threading algorithms can be used. In this method, the sequence of unknown structure and function is forced to adopt all known protein domain folds and scored for its suitability for each fold.

Supplementary information from hydrophobicity profiles, which show which residues prefer to be away from the cellular environment, and Ramachandran plots, which denote the best angles for amino acids in a structure, can be used to determine the accuracy of alignment-based methods. Together, these are used to identify the most accurate protein of known function, which is used as a means to infer the function of the unknown protein.

Sources

https://academic.oup.com/bib/article/7/3/225/326173
wp.nyu.edu/…/Leture-3-ch04-2014.ppt_.pdf
https://medcraveonline.com/MOJPB/MOJPB-07-00233.pdf

Computational-based Approaches to Protein Function Prediction

Can existing proteins be used to determine the function of an unknown protein?

How can homologous sequences be used to identify protein function?

Identifying function from higher-order folding

Sources

Further Reading

Hidaya Aliouche