PREDICTING SECONDARY STRUCTURES OF PROTEINS: A Domain Theory

An important problem in molecular biology is that of predicting "protein
secondary structure."  Briefly, the task is:
 
  for each element in a protein sequence (drawn from a 20-letter alphabet), 
      assign it to one of three classes (alpha helix, beta sheet, or coil).  

Qian and Sejnowski [3] produced a set of examples 
protein-secondary-structure.[train,test] (this directory).  Originally via 
anonymous ftp from spice.cs.cmu.edu; cd to /afs/cs/project/connect/bench.

We present here an imperfect "state-based" domain theory for this problem,
which closely implements the algorithm of Chou and Fasman [1].  Further
details can be found in [2] (available via anonymous ftp from
steves.cs.wisc.edu (128.105.2.201) in pub/maclin.fskbann.ps.Z).

Should you use this dataset, please reference [2], and kindly inform us
of any interesting results you obtain.

					Sincerely,

					Rich Maclin (maclin@cs.wisc.edu)
					Jude Shavlik (shavlik@cs.wisc.edu)

					Computer Sciences Department
					University of Wisconsin
					1210 W. Dayton Street
					Madison, WI  53706


                         The Chou-Fasman Domain Theory
			 -----------------------------

The Chou-Fasman algorithm [1] involves three activities: 
	(1) recognizing "nucleation" sites, 
        (2) extending these sites, and 
        (3) resolving overlapping predictions.
We provide in this file more details of these three steps and describe our
representation of their algorithm as a collection of rules.

To recognize nucleation sites, Chou and Fasman assign two conformation
values to each of the 20 amino acids.  The conformation values represent how 
likely an amino acid is to be part of either a helix or sheet structure, with 
higher values being more likely.  They also group the amino acids into classes 
of similar conformation value.  The classes for helix are formers, 
high-indifferent, indifferent, and breakers.  Those for sheet are formers, 
indifferent, and breakers.  Table 1 defines the values for the various types of
breakers and formers.

Table 2 contains the rules we use to represent the Chou-Fasman algorithm; 
"x@N" is true if x is the amino acid N positions from the one whose secondary 
structure the algorithm is predicting.  It predicts an alpha-helix nucleation
site if for some consecutive set of six amino acids, at least four are helix 
formers and less than two are helix breakers.  (Two helix high-indifferent amino
acids count as a helix former.)  A rule to determine if a location is a 
nucleation site simply adds the helix-former and helix-breaker values for a 
window of amino acids six wide, and if the totals are greater than four and less
than two respectively, predicts a helix nucleation site (proposition init-helix
in our rules).  Nucleation of beta-sheets is similar to alpha-helix nucleation,
except that the window is only five amino acids wide and a sheet nucleation site
is predicted if there at least three sheet formers and less than two sheet
breakers.

The third step of the algorithm - resolving overlaps - is the reason we use
the numbers in Table 1 rather than making the formers and breakers Boolean 
properties.  Chou and Fasman suggest that the conformation values of regions be
compared to resolve overlaps.  This is done in our domain theory by weighting the 
links from various amino acids according to the numbers in Table 1.  For
example, a combination of four alanines (A's) will produce a higher activation 
of the init-helix unit than a combination of four phenylalanines (F's).

The Chou-Fasman algorithm continues to predict alpha-helix as long as the 
predicate cont-helix is true.  The rules define cont-helix mostly in terms of 
helix-breaking rules - a helix continues as long as a break region is not 
encountered.  An alpha-helix break region occurs when an helix-breaker amino 
acid is immediately followed by either another helix-breaker or a 
helix-indifferent amino acid.  A helix is also broken when encountering the 
amino acid proline (P).  The process of extending beta-sheet structures works 
similarly.  The algorithm predicts coil as the default.

Results (from [2]) using data from Qian and Sejnowski's study [3] are
presented in Table 3.  Results are shown for the Chou-Fasman domain theory,
standard neural networks (ANN), and the Chou-Fasman domain theory after
refinement by neural networks.


            Table 1.  Former and breaker values for the amino acids [*].
	    -------------------------------------------------------

helix-former(E)       =  1.37  helix-former(A)  =  1.29  helix-former(L)  =  1.20
helix-former(H)       =  1.11  helix-former(M)  =  1.07  helix-former(Q)  =  1.04
helix-former(W)       =  1.02  helix-former(V)  =  1.02  helix-former(F)  =  1.00
helix-former(K)       =  0.54  helix-former(I)  =  0.50
helix-former(others)  =  0.00

helix-breaker(N)       =  1.00  helix-breaker(Y)  =  1.20  helix-breaker(P)  =  1.24
helix-breaker(G)       =  1.38
helix-breaker(others)  =  0.00

sheet-former(M)       =  1.40  sheet-former(V)  =  1.39  sheet-former(I)  =  1.34
sheet-former(C)       =  1.09  sheet-former(Y)  =  1.08  sheet-former(F)  =  1.07
sheet-former(Q)       =  1.03  sheet-former(L)  =  1.02  sheet-former(T)  =  1.01
sheet-former(W)       =  1.00
sheet-former(others)  =  0.00

sheet-breaker(K)       =  1.00  sheet-breaker(S)  =  1.03  sheet-breaker(H)  =  1.04
sheet-breaker(N)       =  1.14  sheet-breaker(P)  =  1.19  sheet-breaker(E)  =  2.00
sheet-breaker(others)  =  0.00

-----
[*] We produced these values using the tables reported by Chou and Fasman 
[1, pg. 51].  We normalized the values for formers by dividing the conformation
value of the given former by the conformation value of the weakest former.  So
for example, the helix former value of alanine (A) is 1.29 since the helix 
conformation value of alanine is 1.45 and the conformation value of the weakest
helix former phenylalanine (F) is 1.12.  Breaker values work similarly except
that the value used to calculate the breaker value is the multiplicative
inverse of the conformation value.

We did not directly use the values of Chou and Fasman for two reasons.  
One, we wanted smaller values, to decrease the number of times three very strong
helix-formers would add up to more than 4 (and similarly for sheets).  Two, 
breaker conformation values tend to be numbers between 0 and 1 with the stronger
breakers being close to 0.  We wanted the breaker value to be larger the 
stronger the breaker, so we used the inverse of the breaker's conformation 
value (restricting the result to not exceed 2).


Table 2.  The Chou-Fasman algorithm expressed as inference rules.
-----------------------------------------------------------------

/* As formulated, the structure-prediction task is to look at a 13-element
   "window" in a protein sequence and categorize the center element.  This
   window is slid down the sequence until all elements (amino acids) are
   classified.  The ends are padded with an 21st element ("solvent")
   so that the window is always full.  */

/* This is a "state-based" domain theory, by which we mean that
   the outputs (states) from one window position form part of the
   input when classifying the next element.  More explicitly,
   the outputs of this domain theory are:
	helix(output), sheet(output), coil(output).
   The previous outputs are part of the input and are noted as:
	helix(input), sheet(input), coil(input). */

/* The domain-theory syntax is roughly Prolog (no variables are used).
   Some non-standard syntax is used for summing the values in Table 1. */

/* Rules for recognizing nucleation sites. */

       init-helix	:-  position=5
				__
				>  helix-former(amino-acid@position)    >  4 
				--
			    position=0

				AND

			    position=5
				__
				>  helix-breaker(amino-acid@position)   <  2 
				--
			    position=0


	init-sheet	:-  position=4
				__
				>  sheet-former(amino-acid@position)    >  3 
				--
			    position=0

				AND

			    position=4
				__
				>  sheet-breaker(amino-acid@position)   <  2 
				--
			    position=0


/* Rules for pairs of amino acids that terminate helix structures. */

	helix-break@0	:-	N@0.
	helix-break@0	:-	Y@0.
	helix-break@0	:-	P@0.
	helix-break@0	:-	G@0.
	helix-break@1	:-	N@1.
	helix-break@1	:-	Y@1.
	helix-break@1	:-	P@1.
	helix-break@1	:-	G@1.
	helix-indiff@1	:-	K@1.
	helix-indiff@1	:-	I@1.
	helix-indiff@1	:-	D@1.
	helix-indiff@1	:-	T@1.
	helix-indiff@1	:-	S@1.
	helix-indiff@1	:-	R@1.
	helix-indiff@1	:-	C@1.

	break-helix	:-	helix-break@0, helix-break@1.
	break-helix	:-	helix-break@0, helix-indiff@1.


/* Rules for pairs of amino acids that terminate sheet structures. */

	sheet-break@0	:-	K@0.
	sheet-break@0	:-	S@0.
	sheet-break@0	:-	H@0.
	sheet-break@0	:-	N@0.
	sheet-break@0	:-	P@0.
	sheet-break@0	:-	E@0.
	sheet-break@1	:-	K@1.
	sheet-break@1	:-	S@1.
	sheet-break@1	:-	H@1.
	sheet-break@1	:-	N@1.
	sheet-break@1	:-	P@1.
	sheet-break@1	:-	E@1.
	sheet-indiff@1	:-	A@1.
	sheet-indiff@1	:-	R@1.
	sheet-indiff@1	:-	G@1.
	sheet-indiff@1	:-	D@1.

	break-sheet	:-	sheet-break@0, sheet-break@1.
	break-sheet	:-	sheet-break@0, sheet-indiff@1.


/* Rules for continuing structures. */

	cont-helix	:-	not P@0, not break-helix.
	cont-sheet	:-	not P@0, not E@0, not break-sheet.


/* Rules for predicting helix: either by nucleation or propagating the last state. */

	helix(output)	:-	init-helix.
	helix(output)	:-	helix(input), cont-helix.


/* Rules for predicting sheet: either by nucleation or propagating the last state. */

	sheet(output)	:-	init-sheet.
	sheet(output)	:-	sheet(input), cont-sheet.


/* Rules for predicting coil (the default). */

	coil(output)	:-	helix(input), break-helix.
	coil(output)	:-	sheet(input), break-sheet.
	coil(output)	:-	coil(input), not init-helix, not init-sheet.


             Table 3.  Results from different predictions methods.
             ----------------------------------------------------

   -------------------------------------------------------------------------
   |            |       Testset Accuracy        | Correlation Coefficients |
   |            ------------------------------------------------------------
   |   Method   | Total | Helix | Sheet | Coil  |  Helix |  Sheet |  Coil  |
   -------------------------------------------------------------------------
   |Chou-Fasman | 57.3% | 31.7% | 36.9% | 76.1% |  0.24  |  0.23  |  0.26  |
   |    ANN     | 61.8  | 43.6  | 18.6  | 86.3  |  0.35  |  0.25  |  0.31  |
   |  FSkbann   | 63.4  | 45.9  | 35.1  | 81.9  |  0.37  |  0.33  |  0.35  |
   -------------------------------------------------------------------------


                                   References
				   ----------
[1] Chou, P. and Fasman, G., "Prediction of the secondary structure of proteins
      from their amino acid sequence", Advanced Enzymology 47 (1978), pp 45-148.

[2] Maclin, R. and Shavlik, J., "Refining algorithms with knowledge-based
      neural networks:  Improving the Chou-Fasman algorithm for protein folding",
      Working Paper 91-2, Computer Sciences Dept., Univ. of Wisconsin - Madison,
      1991.

[3] Qian, N. and Sejnowski, T., "Predicting the secondary structure of globular
      proteins using neural network models", J. of Molecular Biology 202 (1988),
      pp 865-884.