Adaptive Gathering Behaviors | AI Game Development: Synthetic Creatures with Learning and Reactive Behaviors

The gathering component of the tactical behaviors is responsible for deciding which type of objects the animat needs to collect. The decision is based on the feedback, which depends on the animat's moods.

Actions

The RL algorithm learns the most suitable actions. In this case, the actions correspond to the possible types of objects that can be gathered: armor, health, weapons, ammo, or none. The empty action is excluded from the learning; instead, it is selected by default if none of the other actions have a positive benefit. The other component learning types of movement are responsible for determining whether gathering is necessary. (For instance, if the health and armor are almost full, gathering may not be worthwhile at all.)

States

All RL algorithms require state variables to learn the policy. For gathering behaviors, the state mostly represents the current inventory: ammo available, number of weapons, health, and armor levels. These factors determine whether it's possible to collect items, but the moods determine whether the animat "feels" it's important. As such, moods are also included in the state used by RL.

Reward Signal

The reward signal is mainly based on the collection of items. If an item affects the player, it's likely there will be some form of reward. Depending on the mood, the reward may have different values. (For instance, health and armor seem insignificant to an angry player.)

Learning Policy

The gathering mode is chosen at regular intervals, triggered by a reward signal (that is, an object is collected) or after a certain amount of time elapses. The learning is achieved with a simple statistical technique. To update the estimate of the return, the reward is added to the total reward accumulator, and the variable counting the number of samples is increased. The estimated return is the total divided by the number of samples (that is, the average).

During learning, a new action is chosen stochastically, with probabilities proportional to their estimated return. When the learning is satisfactory, the gathering behavior may include multiple modes simultaneously (for example, gathering health and armor at the same time).