Structure and Functional Analysis of Protein Sequences via InterNet

Hypertext Manual

In this review I summarise the freely accessed InterNet server-client tools for:

Functional pattern recognition
Secondary structure and transmembrane segment prediction
Analysis of evolutionary information contained in multiple sequence alignments
Identifying proteins based on a set of peptide fragment weights produced by a specific digestion

Principles of methods implemented on these servers are compared.

Credits: The manual is written by V.M. Morozov, CHEMICAL ENZYMOLOGY DIVISION, Chemistry Department, Moscow State University. Your comments/suggestions for improvements will be appreciated.

Last modified: January 1996

FUNCTIONAL PATTERN RECOGNITION

Sometimes the sequence of an uncharacterised protein translated from genomic sequences is too distantly related to any protein of known function by overall sequence alignment, but the protein function can be identified by the occurrence in its sequence of regions resembling a known functional site or a conserved protein family fragment. There are some databases contenting the information derived from amino acid structure of conserved and functional sites with InterNet access for similarity searching.

The PROSITE server (e-mail server: PROSITE@EMBL-HEIDELBERG.DE) scans query sequence to matches with patterns from the PROSITE database. The pattern is entry of conserved amino acid residues by considering acceptable changes. The patterns from PROSITE usually are related with protein function and specific for protein family. The result of analysis is set of matches with pattern or pattern portion.

There are many functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence. So other techniques to characterize and detect such domains are used. The BLOCKS server (the e-mail server: BLOCKS@HOWARD.FHCRC.ORG) compares a query sequence with the BLOCKS, protein blocks database. The blocks are ungapped multiple alignments of highly conserved protein regions. The blocks in the BLOCKS database correspond to PROSITE patterns. The information from alignment is represented in a position- specific scoring table, "profile", in which each column of the alignment is converted to a column of a table representing the frequency of occurrence of each of the 20 amino acids. The profile characterises the conserved domain for all positions unlike the pattern representing the invariant and highly conserved residues. The query sequence is successively aligned with all blocks of the database and homologies with the best scores are chosen. Typically, a protein family has more than one conserved region and is presented in BLOCK as a series of blocks separated by unaligned regions. If the query sequence has high homology with some blocks of one family, the evidence that the sequence is related to the protein family. If the query is a nucleotide sequence, it translated in all 6 frames for searching. Here the multiple block alignments with the single sequence might be detected in different frames because of frameshift errors in the sequence. Recently, the PROSITE server also may scan a protein sequence along profiles corresponding PROSITE patterns.

The SBASE server (the e-mail server SBASE@ICGEB.TRIESTE.IT) searches for homologies throughover the SBASE, protein conserved domain database , using the BLAST or the FASTA algorithm. If the query is nucleotide sequence, it is processed in 6 frames by BLASTX. A SBASE entry is protein fragment cluster derived by BLAST algorithm from SWISS-PROT and PIR databases. The functional meaning is assigned to it as defined by publishing. The SBASE covers a broader range of protein conserved domains than the PROSITE database. The search result consists from alignments with statistic parameters. A graphic representation of homologies along the query sequence is possible.

The BCM Search Launcher server (the WWW server at HTTP: //WWW.ABC.HU/BLAST.HTML) prform BLAST search protein or nucleic acid sequences to the integrated nonredundant protein ENTREZ database and sequence databases derived from the ENTREZ database by clustering the protein sequences and matching with annotated domains and sites. Also one may perform BLAST and FASTA search protein sequences against PIMA database of protein sequence patterns. Each pattern is derived from a multiple sequence alignment and expresses the conservation/variation inherent in the underlying set of aligned proteins. An increase in database search sensitivity and selectivity is achieved by assigning higher weights to conserved positions.

The server DISCOVER@VILLAGE.NJIT.EDU detects conserved motifs in a set of related protein sequences and classifies a protein sequence into family annotated in the PROSITE database based on occurrence of characteristic motifs. Some algorithms are used for classification.

SECONDARY STRUCTURE AND TRANSMEMBRANE SEGMENT PREDICTION

Amino acid sequences data are accumulated more rapidly than structure data. Tertiary structure has been experimentally solved for about 3% determined protein sequences. Another 20% proteins can be modelled by homology (InterNet facility: SWISS-MODEL ,SWISSMOD@GGR.CO.UK). So the structure prediction from amino acid sequence is needed. Whereas the tertiary structure prediction is limited by the lack of homologous sequences with known tertiary structure, the secondary structure prediction using aligned sequences without solved structure has overcome the 70% level of average accuracy (Q), evaluated on the single residue states helix, strand and loop. Therefore, the accuracy in secondary structure content prediction has been comparable to that of circular dichroism spectroscopy. Comparing prediction methods is rather problematic because ones were not tested on a same protein set. Also, an accuracy criterion is meaningful. Q-criterin (content of residues with correct predicted secondary structure) is most widespread. Comparison of methods using statistical information, physicochemical parameters, the multilayered neural network showed advantage of PHD (Profile neural network systems from HeiDelberg), particularly in β-structure prediction. In all algorithms the prediction accuracy varies among proteins and depends on structural type. It is advisable to apply some algorithms and consider coinciding predictions. However it was observed that for 20% of the residues, various algorithms produced the same but wrong predictions. This may suggest an upper bound on the accuracy of secondary structure predictions based on local information, whereas interaction between sites separated along a sequence may be important for secondary structure formation.

The PHD server (PREDICTPROTEIN@EMBL-HEIDELBERG.DE) predicts protein secondary structure, residue solvent accessibility, and helical transmembrane regions. The predictions are produced by profile based neural network systems. The use of evolutionary information as the multiple sequence alignment improves prediction accuracy. If the inquiry is single sequence, the SWISS-PROT database is scanned for similar sequences and a multiple sequence alignment is produced. For water-soluble globular proteins with at least one known homologue, the PHD algorithm has an expected overall three-state accuracy, Q=72,19,3% evaluated on 250 proteins in cross-validation experiments. The advantages of method are improved prediction of betastrands and structure segment lengths. A comparison of pairs of structurally homologous proteins with divergent sequences reveals that considerable variations in the position and length of secondary structure segments are admissible within the same fold. So it is sufficient to predict the location of secondary structure segments. To evaluate such prediction a measure of segment overlap that is somewhat insensitive to small variations in secondary structure assignments was suggested. For prediction by PHD method, the segment overlap measure is 72%, whereas the one is 37% for random protein pairs and 90% for homologous protein pairs. The solvent accessibility prediction has accuracy of 57% for 3 states of relative solvent accessibility, and 24% for 10 states of relative solvent accessibility with correlation between experimentally observed and predicted relative solvent accessibility of 0.54, evaluated in multiple cross-validation on a set of 238 globular proteins. The prediction of transmembrane segments has accuracy of about 95% in 69 membrane proteins. The number of false positives for globular proteins is about 5%. False positives are observed more often for globular proteins with pre- dominantly beta structure . In all predictions a reliability index is given for each residue along sequence.

The PSSP / NNSSP (SERVICE@BCHS.UH.EDU) server predicting secondary structure use "nearest-neighbour" method (NNSSP). The secondary structure of most homologous protein region from the structure database is assigned to the central residue in the input window. Homology is evaluated by scoring system that combined the sequence similarity matrix with the local structural environment scoring scheme. Using multiple sequence alignments the method achieves an over all three-state accuracy of 72,4(8,8% (67,7% on base single sequences) tested on the same protein set as the PHD method. The NNSSP method is more efficient to predict alpha-helices, and the PHD method - to predict beta-strands. In the NNSSP method the top 43% of predictions are 81% accuracy . In PHD method the 62,5% of most reliable predictions are 82,9%% accuracy. These results show that the most of proteins can be predicted with considerable accuracy.

The SOPM (DELEAGE@IBCP.FR) server predicts secondary structure by SOPM method. The first step subdatabases are build from the protein secondary structure database based on (i) homology with query sequence or (ii) structural class prediction from amino acid content, if statistical significant homology is not detected. The second step secondary structure is predicted for each protein of the subdatabase using a predictive algorithm based on sequence similarity. The third step is to iteratively determine the predictive parameters that optimise the prediction quality on the whole sub-database. The last step is to apply the final parameters to the query sequence. This method correctly predicts with accuracy 69%. Except prediction by SOPM method the server provides consensus secondary structure prediction by SOPM and the other three methods. Also, a query sequence may be analysed for sites from the PROSITE by the PATTERN program.

The NNPREDICT (NNPREDICT @CELESTE.UCSF.EDU ) server predicts secondary structure based on the neural network . The network takes as input a sequence of one-letter amino acid codes and hydrophobicity periodicity (vector moments). The prediction accuracy is 64%. If tertiary class of the protein (either none, alpha, beta, alpha-beta, or alpha/beta) is given the accuracy goes up to 79% in alpha- proteins and 70% in beta-proteins.

The PSA (psa-request@darwin.bu.edu) "system estimates the tertiary structural class and secondary structure of a protein from its amino acid sequence. The PSA system is intended for nonhomologous sequences that are unlike any other sequences in the sequence databanks.The PSA system is currently based on 15 families containing 209 probabilistic Discrete State- space Models (DSMs),which are intended to represent the major folding classes in J. Richardson's taxonomy of tertiary structural domains".

The SCOP system searches sequence homology with Protein Data Bank (PDB) or representive SCOP database.

The GENEBEE (SERVE@INDY.GENEBEE.MSU.SU) server predicts secondary structure by two methods. The first one is the Garnier-Robson algorithm modified by Brodsky et al. The second one predicts the secondary structure, polarity and buriedness of residues based on homology with proteins from the structure database. The homology is evaluated similarity to NNSSP method. The probabilities of each secondary structure state (alpha-helix, beta-strand, coil) are assigned using parameters optimized for a training set.

SSCP: Secondary structural content prediction from amino acid composition

SSPRED: a program that predicts secondary structures of proteins

The TMpred server predicts "putative transmembrane domains in proteins and also speculates on the possible orientation of these segments. For its scoring, it uses a combination of multiple weight-matrices that have been extracted from a statistical analysis of TMbase, a collection of all annotated transmembrane proteins present in SwissProt." The TMAP server (tmap@embl-heidelberg.de) identify helical transmembrane segments and predict the membrane topology (intra- or extracellular sidedness) based on multiple sequence alignment, If quary is a single amino acid sequence, the server performs BLITZ search of the SWISS-PROT database and multiply alignement by CLUSTALW algorithm. The COILS predicts (2 stranded) coiled coil regions in proteins. The program that compares a sequence to a database of known parallel two-stranded coiled-coils and derives a similarity score. By comparing this score to the distribution of scores in globular and coiled-coil proteins, the program then calculates the probability that the sequence will adopt a coiled-coil conformation.

EVOLUTION INFORMATION ANALYSIS

AMAS - Analysis of Protein Multiple Sequence Alignments

The CBRG (CBRG@INF.ETHZ.CH) server performs a multiple alignment of sequences and analyses it. PAM distances are calculated with visualisation as phylogenetic trees or/and graphs. Variation indices in each alignment position are worked out. Surface and interior residues of protein folded structure are predicted by the ETH method on the base of the multiple alignment . The algorithm assumes that all members of the multiple alignment have same tertiary structure. The surface positions are identified based on hydrophilic variation. The interior positions are recognized based on concurrent hydrophobic conservation and variation. The accuracy of prediction is varied from 80 to 100% and depends on the alignment parameters. The surface and interior assignments can be used for the prediction of coil regions, helices and strands. The breaks in secondary structure of a protein family are predicted based on motifs of multiple alignment (e.g. a large amount of prolins or deletions) that tend to be between secondary structure elements.

A multiple alignment may be produced by the CLUSTALW algorithm.The SERVE@INDY.GENEBEE.MSU.SU server (the WWW server at) produces multiple alignment and construct phylogenetic trees by algorithms from the GENEBEE package. The BLOCKS server constructs the protein blocks (e-mail server:BLOCKMAKER@SPARKY.FHCRC.ORG). This server allows to identify conserved regions in protein family. The blocks are produced from unaligned sequence set by two different algorithms.

RECONSTRUCTION OF AMINO ACID SEQUENCES FROM DIGESTION PRODUCTS

The tool MassSearch (CBRG@INF.ETHZ.CH) allows the identification of known proteins from a set of fragment weights after proteolytic or chemical digest . It seeks the SWISS-PROT or EMBL databases for matched sequences. Chemical modifications of amino acid changing the fragment weights may be taken into account. Precision in the searching increases when more than one digestion with different enzymes is available. This method successfully identifies two proteins digested together.

AA Analysis - Protein Identification in SwissProt and PIR using Amino Acid Composition" "AACompIdent is a tool which allows the identification of a protein from its amino acid composition. It searches SWISS-PROT for proteins, whose amino acid compositions are closest to the amino acid composition given."