1.5 Example: Protein Solubility as a Language

1.5 Example: Protein Solubility as a Language

As a preliminary step, let us consider a transformation that is easy to implement with macromolecules but difficult with programmable machines. Practically speaking, any ab initio calculation of the properties of even a small cluster of particles outpaces programmable computational capabilities. For the present purpose, however, we would like to consider an example of a problem that typically arises in computer science—namely, the problem of deciding whether or not a sequence of symbols belongs to a given set of sequences. Such sets are considered in formal language theory. The question is whether it is possible to construct a machine, subject to given constraints, that can recognize the language. For example, the constraint might be that the machine is a finite automaton (as are actual computers).

Consider a language L in which the elements are protein sequences that satisfy a certain property (Davidson and Sauer 1994; Prijambada et al. 1996; Yamauchi et al. 1998). The alphabet of such a language would be a set of amino acids—for instance, the twenty amino acids that are the predominant building blocks of natural proteins. We can choose solubility S in water as the property that has to be satisfied by a sequence p composed of the amino acids that constitute the alphabet (Σ). The conditions c of the process must be fixed (e.g., temperature, pressure, pH, and cosolutes; Laidler and Bunting 1973; Cacace, Landau, and Ramsden 1997). Formally, we can write

where L denotes the language, x is a fixed solubility threshold (massprotein/masssolvent), and we assume that length (|p|) of the sequence of amino acids does not exceed some constant w. The important point is that Sc is a physical and not a formal condition.

In principle, a computer of sufficient size and speed should be able to answer the question whether a given sequence p is a member of L. In practice, however, performing physics calculations to answer the membership question for the above language by implementing formal rules is not efficient. To decide the membership of a sequence in this language, the properties of the (possibly folded) amino acid sequence need to be known, thus the language encodes the protein-folding problem. Calling on calculational methods of physics to solve this problem is clearly daunting; however, it is also possible to decide the membership by actually synthesizing the protein with the sequence in question and measuring its solubility. The synthesis and measurement procedure could be automated. The resulting machine can easily decide for any particular sequence presented to it whether it belongs to L, in effect performing a computation that may well exceed the practical capabilities of presently available general-purpose machines.

1.6 Macro-Micro Interface

Language-recognition problems of the type considered above can be viewed as pattern-recognition problems. The patterns might be computer codes that have to be compiled. Or they might be objects in the world—say, chairs. If all (and only) chairs were marked with a standard printed "C", then it would be easy for a digital computer to say "yes" whenever it is presented with a chair and "no" whenever it is presented with some other object. Without such preprocessing, however, no existing computer program can do this job. The morphology of chairs is too ambiguous and variable. The required program, though it might exist, is too complex to express in a reasonably compressed way, even assuming that we knew how to write it at all. Yet humans perform this transformation with relative ease.

The protein solubility example was intended to show that molecules can be used to perform transformations that are refractory to programmable machines. But of course that example is far from using this power to address any problem of interest. To do so, the molecular level needs to be connected to the external world and the transformation needs to be adapted into a useful function.

We will return to the adaptation issue in section 1.9. Here, it is pertinent to consider the general requirements for input and output (Conrad 1984, 1990). In biological cells, the signals that represent the patterns to be recognized could come from either the internal milieu or the environment. The former case is pertinent to regulation and the latter to perception-action activities. Three levels of scale are involved: macro, meso, and micro. The signals from the environment are generally macroscopic on some dimension of scale (energy, mass, dissipation, time, space) or represent features of the world that are macroscopic. The nerve impulse, for example, is a macroscopic signal. Signals inside the cells (say, diffusion of substances) can be either macroscopic or mesoscopic. The signals constitute the milieu patterns, or context, to which proteins and other biological macromolecules respond. Because these molecules must be sufficiently large to have significant shape features (and shape dynamics), they can be classified as mesoscopic. But the nuclear coordinates couple with the electronic coordinates, so that we also have to think in unambiguously microscopic terms (Conrad 1994a). In short, we have downward flow of influence from the macro to the meso to the micro.

This downward flow is complemented by an upward flow, triggered by the response of the macromolecule or macromolecular aggregate—say, a catalytic response in the case of an enzyme or a mechanical response in the case of a contractile unit. For the present purpose, it is sufficient to think in terms of enzymes. The chemical changes produced in the milieu link the activity of different enzymes. The linking chemicals can be thought of as signals, either because they provide context or because they serve as common intermediates. The communication between the processing macromolecules is thus essentially at a mesoscopic level. Macromolecules can also communicate through direct conformational interactions, in which case the signal energies are in the micro domain. Biological cells are replete with receptors that convert signals representing macro features of the external environment to internal signals that can be brought into the web of meso- and micro-level processing.

The amount of computational work performed at the meso- and the micro- levels should be as great as possible, due to the thermodynamic cost of producing macroscopic signals. Enzymes, as catalysts, are thermodynamically reversible; their pattern-recognition work is free, driven only by the heat bath. The dissipation in a typical biochemical reaction can range from 10 to 100 kT. A nerve impulse might cost 105 to 1010 kT, depending on the size of the neuron. To the extent that processing is kept as close as possible to the microlevel, the amount of information processing obtainable is vastly enhanced.

Macro-micro communication links are essential for any computational system that utilizes the activity of individual molecules, as opposed to systems that employ only statistical aggregates of particles. The signal processing activities of the medium can itself have significant nonlinear dynamics (see chapters 3 and 4 of this volume). The whole medium, not just the controlling macromolecules, can then contribute to the input-output transform. But the controlling macromolecular components are critical, because the recognition-action events would otherwise be slow and difficult to mold for different functionalities. The addition of new signal substances and macromolecular species to the medium need not and in general does not yield an additive response. This nonlinear component interaction is where the potential for performing powerful context-sensitive transforms resides.