# «In Journal of Parallel and Distributed Computing, vol. 26, pp. 125-131, 1995. Analysis of the Convergence and Generalization of AA1 Christophe ...»

Lemma 2 shows that when any two growth nodes, discriminating NI from two subsets of I, are combined, the resulting parent node discriminates NI from the union of these subsets. Theorem 1, which shows that the result of Node Combination is OSDN follows immediately by induction.

Theorem 2 then shows that AA1 fulfills the instance set after NI has been processed. The result follows immediately from the finiteness of the add-list.

**Lemma 1:**

Let G 1,...,G k be the growth nodes resulting from Node Selection and Node Addition. Let I1,...,Ik be the respective sets of instances that each Gi discriminates from NI. Then I=I1∪...∪Ik.

**Proof:**

If Node Selection is sufficient, then the result follows trivially from the algorithm, since Node Selection consists precisely of recruiting nodes that each discriminate NI from some subset of I until the union of these subsets is I. Suppose then that Node Addition is necessary.

Let G 1,...,G j (jk) be the growth nodes resulting from Node Selection alone, and I′=I1∪...∪Ij.

We must show that G j+1,...,G k resulting from Node Addition discriminate NI from I″=I - I′, or similarly that I″=I j+1∪...∪I k. For each instance in I″, there exists at least one variable which is the complement of one of the variables in NI (otherwise the current instance set would be inconsistent).

One such variable is chosen for each instance in I″. Pairs of these selected variables serve as inputs to G j+1,...,G k which are set to the AND or OR function to guarantee that their output is either 1 or 0 when both inputs are matched and the complement when either one is not matched. Hence, each added node will output one value for NI (one of the inputs is not matched) and the complement for the instances of I″ from which its input variables were selected (both inputs are matched). Therefore, G j+1,...,G k discriminate NI from subsets Ij+1,...,Ik, respectively, of I″. Since a variable is selected from each one of the instances in I″, it follows immediately that I″=Ij+1∪...∪Ik.

**Lemma 2:**

Let G 1 and G 2 be two growth nodes other than CTN and OSDN. Let I 1 and I 2 be the sets of current instances that are discriminated from NI by G 1 and G 2, respectively. Let PN be the parent node created when G 1 and G 2 are combined. Then PN discriminates NI from I1∪I2.

**Proof:**

G 1 and G 2 output either 0 or 1 for NI (? is not possible, otherwise they would not discriminate NI and thus could not be growth nodes). Since PN≠NTN, it does not matter whether PN outputs 0 or 1 for NI. Hence, there are eight possible functions for PN, summarized in [2, Table 6.2]. We give the complete proof for the case where both G 1 and G 2 output 1 for NI. Then, the function of PN is the AND of the outputs of G 1 and G 2. Now, because G 1 discriminates NI from the instances in I 1, G 1 outputs 0 for every instance in I 1. Similarly, G 2 outputs 0 for every instance in I 2. Applying the AND function pairwise to G 1 and G 2 will thus cause PN to 1) output 1 for NI and 2) output 0 for every instance in I1 and every instance in I2. It follows that PN discriminates NI from I1∪I2.

**Theorem 1:**

Let G 1,...,G k be the growth nodes resulting from Node Selection and Node Addition. Let I 1,...,I k be the respective sets of instances that each G i discriminates from NI. Let PN be the node obtained after the G i’s have combined. Then PN is OSDN.

**Proof:**

Each node that is the result of combining two growth nodes is labeled as a growth node and can therefore participate in combination with other nodes. Any pair of growth nodes may combine.

Hence, the combination process happens finitely many times (exactly k-1 combinations) and halts when there is a single growth node left, namely PN. By induction on the result of Lemma 2, P N discriminates NI from every instance in I 1 ∪...∪I k. But, from Lemma 1, we have I=I 1 ∪...∪I k. It follows that PN discriminates NI from every instance in I, and hence, PN is OSDN.

**Theorem 2:**

NTN is a complete discriminant node such that AA1 fulfills the new instance set.

**Proof:**

There are only four ways, depending on the polarity of NI and the output of OSDN for NI, that OSDN and CTN may combine to produce NTN, as summarized in [2, Table 6.3]. We give the complete proof for the case where OSDN outputs a 1 for NI and NI is positive. Then NTN’s function is the OR of the outputs of CTN and OSDN. Since OSDN outputs 1 for NI and NI is positive, the N column of OSDN contains only 0’s. Since CTN fulfills the old instance set, the N column of CTN contains only 0’s. Hence applying the OR function pairwise to the N columns of CTN and OSDN will produce the N column of NTN that contains only 0’s. Similarly, the P column of CTN contains 1’s in every cell, except the one corresponding to NI. But by assumption the cell corresponding to N I contains a 1 in the P column of OSDN. Hence applying the OR function pairwise to the P columns of CTN and OSDN will produce the P column of NTN that contains only 1’s. Consequently, NTN is a complete discriminant node, and since NTN outputs 1 for positive instances and 0 for negative instances, AA1 fulfills the new instance set.

Theorems 3, 4 and 5 show that AA1 always fulfills the instance set after any one of the three self-deletion procedures is applied. It follows then from these and the above that AA1 is correct.

**Remark:**

A node's output cannot be inverted in AA1; however, since only the AND and OR functions are used, Boolean logic guarantees that an equivalent result can be obtained by inverting the node’s inputs and changing the node’s function from OR to AND, or vice versa. The entries in the NT can then simply be inverted (note that ? is its own inverse).

**Theorem 3:**

Complete discriminant deletion is correct.

**Proof:**

Let CN be a complete discriminant node, other than the top node. The only nodes needed to build up to CN are those that are in the directed graph rooted at CN. All the other nodes can be removed and the result follows from the above remark, since CN can always be made to fulfill the instance set.

**Theorem 4:**

Non-discriminant deletion is correct.

**Proof:**

This result follows from the fact that if NN is a non-discriminant node, DN a discriminant node, and PN the parent node resulting from combining NN and DN, then PN always discriminates a subset of DN and can thus be replaced by it. We prove this fact for non-inverted inputs. There are three cases.

Case 1: NN contains only one value (0 or 1). It is clear then that PN is either NN or DN. So PN discriminates a subset of DN (i.e., either nothing at all or the same as DN).

Case 2: NN contains two values ((0 and ?) or (1 and ?)). There are three subcases: a) PN’s function is AND and NN contains 1s and ?s or PN’s function is OR and NN contains 0s and ?s, b) PN’s function is AND and NN contains 0s and ?s, c) PN’s function is OR and NN contains 1s and ?s.

In case a), PN is DN with some 0s or 1s possibly replaced by ?, and thus PN clearly discriminates a subset of DN. In case b), PN is NN with some ?s possibly replaced by 0, hence PN contains only 0s and ?s and is thus a non-discriminant node. So PN trivially discriminates a subset (namely the empty one) of DN. Case c) is similar to case b), but PN now contains only 1s and ?s. In all three possible subcases, PN discriminates a subset of DN.

Case 3: NN contains all three values (0, 1 and ?). Assume that PN’s function is the AND function (the case OR is complementary). The 0s and 1s must be in the same column of NN and the other column must contain only ?, otherwise we would be in Case 2 (or NN would be a discriminant node). Without loss of generality, assume that the N column contains only ?. Then the N column of PN will have 0s wherever DN had 0s and ? everywhere else. Let p be a positive cell containing 1 in DN and I p the set of discriminated discordant cells (i.e., cells containing 0 in the N column). The p cell of PN contains 0, 1 or ?, and each cell of Ip contains 0. Thus PN discriminates a subset of DN for p (i.e., the same or nothing). If p contains 0 in DN, it also contains 0 in PN. However each cell of Ip contains ? in PN so PN no longer discriminates p. Thus, PN discriminates a subset (namely the empty set) of DN for p. The above applies to any cell p of DN, hence PN discriminates a subset of DN.

**Theorem 5:**

Locally redundant deletion is correct.

**Proof:**

Both cases of locally redundant deletion applicable to AA1 are essentially special cases of non-discriminant deletion. Therefore, this theorem follows almost immediately from Theorem 4.

4. Discussion Inconsistency One important aspect of real-world training sets is their inherent inconsistency. There are at least two sources of inconsistency in a training set. The first is due to the fact that in real-world situations the training set is obtained from experimentation and thus may contain errors. The second one has to do with the fact that certain applications may not be characterized as functions, but rather as distributions (e.g., the task of classifying objects into overlapping classes).

A very important ability, inexistent in most training set learners, but found in AA1, is the ability to use general rules (short-hand instances) in learning. This allows one to construct an instance set that reflects human learning by first presenting general rules to the system, followed by exceptions. It has long been a criticism of neural networks, that they can not incorporate general rules, and thus are very limited models of human learning. AA1 handles both very specific examples and very general rules. A side-effect of such ability, however, is the introduction of inconsistencies (in the form of exceptions) in the instance set.

In AA1, inconsistency is solved by giving precedence to the newer instances. This simple scheme, though reasonable, has several disadvantages. It requires a smart teacher that knows something about the target function and can thus give the instances in the correct order. Also, it is not necessarily best to try to maintain the instance set consistent (nature is replete with inconsistencies). Current research seeks to expand the learning scheme by removing AA1's rigid precedence given to newer rules and by allowing inconsistencies to subsist (see for example [1]).

Complexity The hierarchical architecture of AA1 guarantees that execution is O(logn) where n is the number of nodes in the network. During learning, note that only the training instance requires full execution of the learning algorithm. Modified instances in the add-list are still fulfilled and will therefore not cause node selection, addition and combination. Learning is bounded by a polynomial function of the number of instances in the instance set.

Generalization In Section 3, we proved that AA1 converges on any arbitrary Boolean instance set. A more interesting aspect of AA1 has to do with generalization, or its ability to induce new knowledge from past experience. The instance set consists of only a subset of the complete function to be learned, and AA1 must guess the remainder of the function. Even though there is no guarantee that the final network is smallest, AA1's bias towards simplicity results not only in increased parsimony, but also in the ability to generalize. This bias is manifest in two ways. First, for each new instance, the output of the current network is checked and when found concordant no changes are made to the network.

Second, after a new instance has been processed, unnecessary nodes are deleted. Empirical studies confirm these findings.

The originality of AA1 lies in the fact that it does not generalize on the basis of proximity of the new input to one of the learned instances, but rather on the basis of how well the network is able to discriminate the new input from learned features of the opposite class.

Extensions Several extensions can be made to AA1. For example, the greedy search used in Node Selection, as well as the choice of inputs to be used to create new nodes in Node Addition, could be enhanced by heuristics. Another important problem to be addressed is AA1's memory requirement at each node. It is clear that each node need not store a D column in its node table. A global D column would be sufficient to meet the goal of Node Selection. However, it is still unclear how to do away with (or effectively replace) the P and N columns.

5. Simulation Results AA1 was tested on several real-world applications, drawn from the Irvine machine learning database [8]. No minimization was attempted, and self-deletion was limited to complete discriminant deletion. The results are summarized in the table below. For Congressional Voting Data and Hepatitis, the results are averages (on the test set) over 10 runs of AA1. In each run, the instance set and the test set were regenerated with 9/10 of the data used for training and 1/10 used for testing.

6. Summary This paper presents a formal review of Adaptive Algorithm 1 (AA1) of Adaptive SelfOrganizing Concurrent Systems (ASOCS). After a brief overview of AA1 and a description of the algorithm, a detailed proof of correctness for AA1 is given. A series of lemmas and theorems that capture the actions of the algorithm are stated and formally proved, thus showing that AA1 is guaranteed to converge on any arbitrary Boolean instance set. The question of inconsistency and AA1's solution to it are discussed. Intuitive reasons for AA1's ability to generalize are briefly outlined. Extensions to AA1 are suggested. Results of simulations of AA1 on real-world data sets are also reported and show that AA1 has promising generalization performance.

References [1] Giraud-Carrier, C. A Precept-Driven Learning Algorithm. Master's Thesis, Brigham Young University, Department of Computer Science, 1993.

[2] Martinez, T.R. Adaptive Self-Organizing Networks. Ph.D. Thesis, UCLA Tech. Rep. - CSD 860093, 1986.

[3] Martinez, T.R., and Vidal, J.J. Adaptive Parallel Logic Networks. Journal of Parallel and Distributed Computing, 5, 1 (Feb. 1988), 26-58.