# «In Journal of Parallel and Distributed Computing, vol. 26, pp. 125-131, 1995. Analysis of the Convergence and Generalization of AA1 Christophe ...»

In Journal of Parallel and Distributed Computing, vol. 26, pp. 125-131, 1995.

Analysis of the Convergence and Generalization of AA1

Christophe Giraud-Carrier and Tony Martinez

Department of Computer Science, Brigham Young University

Abstract

AA1 is an incremental learning algorithm for Adaptive Self-Organizing Concurrent Systems

(ASOCS). ASOCS are self-organizing, dynamically growing networks of computing nodes. AA1

learns by discrimination and implements knowledge in a distributed fashion over all the nodes. This paper reviews AA1 from the perspective of convergence and generalization. A formal proof that AA1 converges on any arbitrary Boolean instance set is given. A discussion of generalization and other aspects of AA1, including the problem of handling inconsistency, follows. Results of simulations with real-world data are presented. They show that AA1 gives promising generalization.

1. Introduction The last decade has seen a renewed interest in connectionist computing. Some of the reasons for this interest can be traced to the discovery of the wide range of neural network applications, the increasing interest in self-organizing systems, and the design of improved learning algorithms. A good overview of the current state of the art is found in [11]. Most of the current models are based on static networks of computing nodes, where learning is effected by changing the weights of the connections between nodes and/or by modifying the nodes’ functions. Adaptive Self-Organizing Concurrent Systems (ASOCS) are dynamic networks that learn by adapting both the function performed by the nodes and the overall network topology. Hence, the network grows over time to fit the problem.

ASOCS operate in one of two modes: learning and execution. In execution mode, the network receives inputs and produces outputs as the data flow asynchronously and in parallel through the network. In learning mode, the system is incrementally presented a set of Boolean rules, which is called the instance set. As each instance is presented, the network adapts both its topology and its nodes’ functions to learn the new instance, while preserving consistency with previously acquired knowledge.

Several learning algorithms have been developed for ASOCS. They include Adaptive Algorithm (AA) 1 [3], AA2 [4] and AA3 [5]. AA1 is the object of this paper. AA1 learns by discrimination and implements knowledge in a distributed fashion. This paper analyzes the convergence and generalization properties of AA1.

Convergence refers to the system's ability to learn the instance set. Convergence has played a significant historical role in the development of neural networks. Rosenblatt's perceptron [9], the first neural network, was quickly proven to be limited to linearly separable functions [7]. However, it was also shown that a (at least) three-layer network could be used to represent any arbitrary function.

The challenge was thus to design a learning algorithm for multi-layer networks. Researchers struggled until the first multi-layer learning algorithm, backpropagation [11, 12], was proposed.

However, because it uses a gradient-descent technique and is thus subject to local minima, there is no guarantee, in practice, that backpropagation will find a plausible solution to a training set. The first part of this paper reviews AA1 and formally proves that AA1 converges on any arbitrary Boolean instance set. Moreover, unlike backpropagation which requires several passes over the training set, AA1 is a "one-shot" learner with fast, bounded learning time.

During learning, the system is presented with only a subset of the function it is to solve.

Generalization refers to the system's ability to correctly guess the rest of the function. To be of any practical use, a system should not be confined to simply remembering (i.e., rote learning), but should also be capable of inducing new knowledge from past experience. Sections 4 and 5 of this paper discuss AA1's generalization scheme and show that it produces promising results. AA1's originality lies in the fact that it generalizes not on the basis of how closely the new unknown input matches one of the learned rules, but rather on how well the network can discriminate the unknown input from the learned features of the opposite class.

Section 2 gives an overview of AA1. Section 3 contains the proof that AA1 converges on any arbitrary Boolean instance set. Section 4 discusses other features of AA1 including generalization, and the handling of inconsistency. Section 5 presents results of simulations of AA1 on real-world data. Section 6 summarizes the paper.

2. AA1 Learning Algorithm This section gives a short overview of AA1. Only those details essential to the proof of the following section are presented. For more details on AA1, see [2, 3].

An instance set is a set of instances, where each instance is a conjunction of input values together with the output value they imply. We use V and V' to mean V is positive (i.e., true) and V is

**negative (i.e., false), respectively. The following are examples of instances:**

AB → Z (i) B'C → Z (ii) AC' → Z' (iii) The conjunctive part of instances need not contain an input value for each input variable.

Such instances are used as short-hand. If I is an instance in which no value is specified for some input variable V, then I represents both instances obtained from I by adding V and V', respectively, to the conjunctive part of I. For example, if A, B and C are the only 3 input variables, then instance (i) represents both instances ABC → Z, and ABC' → Z. In other words, instance (i) means that when A and B are both positive, Z must become true, regardless of the values of any other input variables.

The value of the output implied by an instance is its polarity. Instances (i) and (ii) are positive and instance (iii) is negative. Two instances with the same polarity are said to be concordant with respect to each other, while two instances of opposite polarity are discordant. Instances (i) and (ii) are concordant; instances (ii) and (iii) are discordant.

The basic architectural unit in AA1 is a node. During execution, each node computes the AND or OR function of its (possibly inverted) inputs. During learning, since all inputs may not have a value in an instance, these functions are extended to accommodate unspecified (denoted ?) inputs (e.g., 0 AND ? implies 0, 0 OR ? implies ?). In addition, each node records, in the form of a node table (NT), the value it outputs for each instance in the training set. The NT has two columns, P and N, holding the output values for positive and negative instances, respectively, and a column D which is used in the node selection part of the algorithm presented below. Figure 1 shows an example of a node.

B G C D Figure 2 - Example of a Network If a cell in the NT contains the value 0 or 1, the node discriminates the instance represented by that cell from all discordant instances whose cells contain the opposite value. For example, let K be the node n3 of Figure 2. Consider the first cell of the P column whose value is 1, and let I be the corresponding instance. If K outputs 1, there is no way to tell if it is I that is matched or if it is the instance corresponding to the third cell of the N column. However, if K outputs 0, clearly I is not matched. But K outputs 0 for the instances corresponding to the first and second cells of the N column, so K is able to discriminate between I and these two instances. Note that cells containing ?'s

**cannot participate in discrimination. There are four classes of nodes, based on discrimination:**

• discriminant node: discriminates at least one positive instance from one negative instance,

• non-discriminant node: does not discriminate at least one positive instance from one negative one,

• complete discriminant node: discriminates every positive instance from every negative instance,

• one-sided discriminant node: asserts one value for either all positive or all negative instances, and the opposite value for at least one discordant instance (see node n2 of Figure 2).

Discrimination is the key factor in AA1. When it is presented an instance set, AA1 learns to discriminate between positive and negative instances. AA1 fulfills the instance set (i.e., converges) if the network it builds outputs 0 for every negative instance and 1 for every positive instance. That is, AA1 converges if and only if the final network's top node is complete discriminant and has 1's in its P column.

Learning proceeds as follows. Instances are introduced one at a time in an incremental fashion. For each new incoming instance, AA1 goes through a preprocessing phase that maintains the instance set consistent and seeks to minimize it. An instance set is inconsistent if it contains discordant instances whose conjunctive parts could be true simultaneously. For example, instances AB → Z and BC → Z' produce an inconsistency since for A, B and C positive, the first one implies that Z is positive while the second one implies that Z is negative. Inconsistency is solved by giving precedence to the newer instances (see [6] for details). Complete minimization is not reasonable, but AA1 attempts partial minimization through pairwise comparison of the training instance and instances in the current instance set. The correctness of this aspect of the algorithm has been proved elsewhere [10]. The correctness of the actual learning algorithm has not. Section 3 fills this gap.

The result of the above preprocessing phase is a delete-list and an add-list containing the instances to be removed and added, respectively, from the current instance set to keep it consistent and somewhat minimal. To process the delete-list, AA1 simply causes each node to empty the corresponding cells in its NT. Each instance in the add-list is then added to the current instance set and presented to the network. Each node places its output for the new instance in the corresponding cell of its NT. The network's output is then checked. If it is concordant with that of the new instance, no changes are made to the network and the next training instance can be processed. If, on the other hand, it is discordant or ?, then the current top node is no longer complete discriminant and modifications must be made to the network. Since the current top node correctly discriminates the instance set, with the exception of the new instance, AA1 constructs a new node that discriminates the new instance from all discordant instances. This one-sided discriminant node (OSDN) is then combined with the current top node to build a new complete discriminant node. The network finally undergoes a phase of self-deletion in which nodes that are no longer needed (such as nondiscriminant nodes) are removed from the network. Self-deletion increases parsimony. Details on the various kinds of self-deletions may be found in [2, 3].

The construction of OSDN is effected by a process of node selection and node combination.

Node selection consists of a greedy search through the network for a set of nodes that, when combined, give rise to OSDN. The nodes that are selected are those nodes that discriminate the new instance from the largest number of remaining non-discriminated discordant instances (the D column of the NT keeps track of which discordant instances are already discriminated by the selected nodes).

If node selection fails, i.e., if the selected nodes would not combine to create OSDN, then node creation takes place, and AA1 guides the construction of new nodes that can discriminate the new instance from all remaining non-discriminated discordant instances. From each remaining nondiscriminated instance an input is randomly selected that is sufficient to discriminate it from the new instance. Pairs of these chosen inputs are then used as inputs for the new nodes whose functions must be set so that they output 0 or 1 when both inputs are matched and the complement when either one is not matched. Once all the necessary growth nodes have been selected and/or created, they are combined so as to build OSDN. Each combination connects two nodes to a new node that discriminates the union of the discriminations done by the nodes from which it was created.

Due to space, we can only give a high-level example of how AA1 updates the network upon receipt of a new instance. A detailed example is found in [3]. Consider the network of Figure 2.

Assume the new instance is positive and causes the top node to output ?, node n3 to output 1 and node n2 to output ?. The result of processing the new instance is in Figure 3. Note that Node Selection took place and node n5 was selected as a growth node, since it discriminates the new instance from two of the three discordant instances. Node Addition was then necessary and resulted in the creation of input A (sufficient to discriminate the new instance from the remaining discordant non-discriminated instance). Node n5 was subsequently combined with input A (degenerated growth node) to produce a OSDN (node n3). The OSDN was finally combined with the old top node (node n2) to produce a new complete discriminant top node (node n1).

When two nodes N 1 and N 2 combine and give rise to a parent node PN with function f, PN’s node table is loaded by applying f pairwise to corresponding cells in N 1 ’s and N 2 's node tables. It follows that PN's node table contains the values that PN would output for each instance in the current instance set (i.e., there is no need to rebroadcast all the instances of the instance set to the modified network).

n1 OR

3. AA1 Is Correct In this section, we prove that AA1 always fulfills the current instance set. Note first that processing the delete-list is trivially correct since emptying cells does not change the complete discriminant nature of the current top node. We show that processing the add-list is also correct.

Let NI be the instance being processed. Let I be the set of current discordant instances that NI must be discriminated from, and OSDN a one-sided discriminant node that discriminates NI from I.

Let CTN be the current top node, and NTN the node resulting from combining CTN and OSDN.

Lemma 1 shows that Node Selection and Node Addition give rise to a set of growth nodes, each of which discriminates NI from some subset of I such that the union of these subsets is I.