12. | Classification Methods for Remotely Sensed Data, Second Edition

[Cover] [Contents] [Index]

Page 108

very significant effect on the performance of the network. In addition, the design (‘architecture’) of the network is important in ensuring that the network is able to generalise from the sample (training) data. Finally, encoding of the data can make a substantial difference to the accuracy achieved by the network.

Kavzoglu (2001) reports on an extensive survey of the impact of these choices on a network’s ability to classify unknown patterns. He notes that a multilayer perceptron requires at least one hidden layer, in addition to the input and output layers, and that the number of neurones in the hidden layer(s) will significantly influence the network’s ability to generalise from the training data to unknown examples (i.e. pixels in the image to be classified). Small networks cannot identify fully the structures present in the training data (known as underfitting), while large networks may determine decision boundaries in feature space that are unduly influenced by the specific properties of the training data. The latter phenomenon is known as overfitting. A single hidden layer is thought to be adequate for most classification problems, but where there are large numbers of output classes then two hidden layers may produce a more accurate result. Kanellopoulos and Wilkinson (1997) suggest that where there are twenty or more output classes, then two hidden layers should be used, and that the number of neurones in the second hidden layer should be equal to two or three times the number of output classes.

Sarle (2000) lists a number of reasons to illustrate the factors involved in determining an appropriate number of hidden-layer neurones. These factors include the number of input and output neurones, the size of the training data set, the complexity of the classification being performed, the amount of noise present in the data to be classified, the type of activation function used by the hidden layer neurones, and the nature of the training algorithm. Kavzoglu (2001) reviews a number of ad hoc rules and recommendations, and concludes that Carson’s (1998) proposal produces good results in the cases tested. Garson (1998) proposes that the number of hidden-layer neurones should be set to the value N_P(r(N_I+N_o)), where N_p is the number of training samples, N_I is the number of input features, and N_o is the number of output classes. The parameter r is related to the noisiness of the data and the cleanness or simplicity of the classification. Typical values of r are in the range 5 (clean data) to 10 (less clean data), but values as high as 100 or as low as 2 are possible. Pruning algorithms (Section 3.1.4) can be used once the network has been trained in order to remove links between neurones that are ineffective (Kavzoglu and Mather, 1999) and thus increase the generalisation capabilities of the network.

The learning rate η in Equation (3.5) must be specified. If Equation (3.9) is used for updating the weights then the user must define both learning rate η and momentum coefficient ξ. Nor should the value of the parameter η in the steepest descent minimisation procedure be too large, or the result

[Cover] [Contents] [Index]