1.5 Example: Protein Solubility as a Language

As a preliminary step, let us consider a transformation that is easy to implement with macromolecules but difficult with programmable machines. Practically speaking, any ab initio calculation of the properties of even a small cluster of particles outpaces programmable computational capabilities. For the present purpose, however, we would like to consider an example of a problem that typically arises in computer science—namely, the problem of deciding whether or not a sequence of symbols belongs to a given set of sequences. Such sets are considered in formal language theory. The question is whether it is possible to construct a machine, subject to given constraints, that can recognize the language. For example, the constraint might be that the machine is a finite automaton (as are actual computers).

Consider a language L in which the elements are protein sequences that satisfy a certain property (Davidson and Sauer 1994; Prijambada et al. 1996; Yamauchi et al. 1998). The alphabet of such a language would be a set of amino acids—for instance, the twenty amino acids that are the predominant building blocks of natural proteins. We can choose solubility S in water as the property that has to be satisfied by a sequence p composed of the amino acids that constitute the alphabet (Σ). The conditions c of the process must be fixed (e.g., temperature, pressure, pH, and cosolutes; Laidler and Bunting 1973; Cacace, Landau, and Ramsden 1997). Formally, we can write

where L denotes the language, x is a fixed solubility threshold (mass_protein/mass_solvent), and we assume that length (|p|) of the sequence of amino acids does not exceed some constant w. The important point is that S_c is a physical and not a formal condition.

In principle, a computer of sufficient size and speed should be able to answer the question whether a given sequence p is a member of L. In practice, however, performing physics calculations to answer the membership question for the above language by implementing formal rules is not efficient. To decide the membership of a sequence in this language, the properties of the (possibly folded) amino acid sequence need to be known, thus the language encodes the protein-folding problem. Calling on calculational methods of physics to solve this problem is clearly daunting; however, it is also possible to decide the membership by actually synthesizing the protein with the sequence in question and measuring its solubility. The synthesis and measurement procedure could be automated. The resulting machine can easily decide for any particular sequence presented to it whether it belongs to L, in effect performing a computation that may well exceed the practical capabilities of presently available general-purpose machines.