Optimization of Perceptron Weights | AI Game Development: Synthetic Creatures with Learning and Reactive Behaviors

How it is possible to apply these techniques to teach a perceptron, and specifically train the weights to produce the right results? This is done on a case-per-case basis. Given specific inputs ^mx_(k), the corresponding desired output y_(k) = t is needed; this is the target for the optimization.

In computer games, the desired output can be obtained in many ways. It can be designed by an expert (training), copied from human behavior (imitation), computed by another AI component (automated learning), or based on a previous result (bootstrapping). This process is discussed further in Chapter 35, "Designing Learning AI."

Optimization of perceptrons is also a best-guess process. In this case, the estimate is a set of weights. The output of the perceptron can be evaluated with these weights. Comparing the actual output to the desired output gives us an error. Then, given the current network, we compute a new estimate of the weights that generate fewer errors. Correcting the weights is done by the delta rule.

The Delta Rule Explained

The delta rule is used the most often to adjust the weights. The key observations are as follows:

Each weight contributes a variable amount to the output.
The scale of the contribution depends on the input.
The error in the output can be "blamed" on the weights.

Reducing the error requires adjusting the output toward its ideal value. To do this, we use multiple small adjustments in the weights instead (because only they can change). The adjustments can be distributed proportionally to the contribution of the weights; the bigger the contribution to an error, the bigger the adjustment!

With the intuitive explanation covered, let's move into a more practical mode. First, relative error E is computed using the difference between the actual output y and the desired output t. This is known as an error function, which usually takes the form of E = ½(t y)². This is known as a least mean square error (LMS). Using the square of the difference t y implies that the error has the same magnitude whether the output is too high or too low. Ideally, this error should be zero and the purpose of the training algorithm is to get it as low as possible.

Technical Fact

The LMS error has a few other useful properties, including the fact it can be easily derived (that is, finding its rate of growth). This is extremely useful, because we can find the gradient of the error function. Indeed, this implies that optimization techniques based on gradient methods will be applicable.

It's now possible to compute the gradient of the error in each of the weights. This is denoted E/w_i, which can be read "gradient of the error E relative to weight w_i." The symbol is not Greek, but is used to denote a partial derivative (that is, a derivative of a variable in terms of another). In this case, the partial derivative simplifies the signed error t y scaled by the input x_i, but negated (see Figure 17.11):

Figure 17.11. Correcting the weights of the perceptron based on the output error, proportionally to the input values.

graphics/17fig11.gif

graphics/212equ01.gif

Then, to adjust the weights, we can use any gradient method: steepest descent, for example. The step Dw_i, which is used to adjust each weight, is the difference between the desired output and the actual output (t y) multiplied by the input x_i:

graphics/212equ02.gif

Again, eta h is a small constant known as the learning rate.

Formal Proof

This part of the chapter provides proof of how to compute the gradient E/w_i for each weight. Those with little interest in the proof should skip a page to the last paragraph in this subsection. However, understanding this outline really opens the doors to more advanced optimization techniques (in academic papers), but isn't really necessary to implement the code! The proof relies on differential calculus.

Noting that the error only depends on the net sum z, the chain rule can be applied:

graphics/213equ01.gif

This expresses the gradient of the error in each weight as the gradient of the error relative to the net sum E/z, multiplied by the gradient of the net sum relative to the weight z/w_i. Noting that z = S_i w_ix_i, the second right-side term can be simplified greatly. (Weights are independent from each other.)

graphics/213equ02.gif

Now for the first right-side term. The net sum is the output y before the activation function s, so the gradient can be re-expressed (using the derivative of the error function):

graphics/213equ03.gif

Because most perceptrons since the Adaline use an identity activation function as s(z) = z, s'(z) = 1 holds. This implies the following:

graphics/213equ04.gif

There we have it. The gradient of the error for each weight is just the negated input multiplied by the relative error x_i(t y). This can be used to adjust the weights by one of the many gradient methods previously presented (usually stepping opposite to the gradient).

This result is surprisingly intuitive. When we want to adjust the output for a given input, all we have to play with is the weights. Each weight could be adjusted to contribute toward the target using (t y). However, the contribution of the weight depends on the value of the input, so this step (t y) can be scaled by x_i to influence only the relevant weights. Scaling that by the learning rate between [0,1] will allow for better learning.