# «Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

James Morgan opposes splitting nodes into more than two branches because the data thins out too quickly.

He also opposes subjecting splits to tests of statistical significance because the large number of required tests renders them useless. Morgan and Sonquist advocate using test data to validate a tree.

AID finds binary splits on ordinal and nominal inputs that most reduce the sum of squares of an interval target from its mean. The best split is always found, look-ahead split searches are available, and cases with missing values are excluded. AID attempts splits on nodes with larger sum-of-square errors first, so that the program may stop after splitting some number of nodes specified by the user. The program stops when the reduction in sum of squares is less than some constant times the overall sum of squares. The constant is

0.006 by default.

**CHAID**

AID achieved some popularity in the early 1970s. Some nonstatisticians failed to understand over-fitting and got into trouble. The reputation of AID soured. Gordan Kass (1980), a student of Doug Hawkins in South Africa, wanted to render AID safe and so invented the CHAID algorithm. CHAID is an acronym for "Chi-Squared Automatic Interaction Detection.” CHAID recursively partitions the data with a nominal target using multiway splits on nominal and ordinal inputs. A split must achieve a threshold level of significance in a chi-square test of independence between the nominal target values and the branches, or else the node is not split. CHAID uses a Bonferroni adjustment for the number of categorical values of the input variable, thereby mitigating the bias towards inputs with many values.

CHAID also uses significance testing to determine the number of branches. The search for a split on a given input proceeds as follows. First assign a different branch to each value of the input. For each pair of branches, form a two-way table of counts of cases in each branch by target value. Find the pair of branches with the smallest chi-square measure of independence. If the significance level is below a threshold, merge the branches and repeat. Otherwise, consider re-splitting branches containing three or more input values: if a binary split is found that exceeds a threshold significance level, split the branch in two and go back to merging branches.

The search ends when no more merges or re-splits are significant. The last split on the input is chosen as the candidate split for the input. Notice that it is generally not the most significant split examined. In the early 1990s, Barry de Ville introduced "exhaustive CHAID," which adopts the most significant split. Exhaustive CHAID produces splits with more branches than the original CHAID.

CHAID treats missing values as special, additional values. Kass cites Morgan and Messenger (1973) for this suggestion.

**An excerpt from a 1980 paper shows Kass' intentions:**

Some specific extensions have been made to AID. They include the [treatment of missing data], and the notion of using the "most significant" split rather than the "most explanatory" which does not take into account the type of input nor its number of categories. [The new] criterion for assessing multi-way subdivisions of the data may yield a more effective analysis than the traditional binary splits that are often misleading and inefficient.

Classification and Regression Trees Leo Breiman and Jerome Friedman, later joined by Richard Olshen and Charles Stone, began work with decision trees when consulting in southern California in the early 1970s. They published a book and commerical software in 1984.

Their program creates binary splits on nominal or interval inputs for a nominal, ordinal, or interval target.

An exhaustive search is made for the split that maximizes the splitting measure. The available measures for an interval target are reduction in square error or mean absolute deviation from the median. The measure for an ordinal target is "ordered twoing.” The measures for a nominal target are reduction in the Gini index and "twoing.” Cases with missing values on an input are excluded during the search of a split on that input.

Surrogate rules are applied to such cases to assign them to a branch.

Historically the most significant contribution is the treatment of over-fitting. The authors quickly concluded that an appropriate threshold for a stopping rule is unknowable before analyzing the data; therefore, they invented retrospective cost-complexity pruning: a large tree is created, then a subtree is found, then another sub-tree within the first, and so on forming a sequence of nested sub-trees, the smallest consisting only of the root node. A subtree in the sequence has the smallest overall cost among all sub-trees with the same number of leaves. For categorical targets, the cost is the proportion misclassified, optionally weighted with unequal misclassification costs. The final pruned tree is selected from subtrees in the sequence using the costs estimated from an independent validation data set or cross validation.

The program contains some other important features: prior probabilities that are incorporated into the split search; use of surrogate splits for missing values; a heuristic search for a split on linear combination of variables, the state of the art for about 15 years until the OC1 and SVM tree algorithms; evaluation of categorical trees in terms of purity of class probabilities instead of classification; bagging and arcing. The original commercial program is still available through Salford-Systems.

** ID3, C4.5, C5**

Ross Quinlan (1979, 1993) wrote his first decision tree program in 1978 while visiting Stanford University.

His ID3 (iterative dichotomizer 3rd) and its replacement, C4.5 programs have served as the primary decision tree programs in the Artificial Intelligence and Machine Learning communities. He attributes the original ideas to Earl Hunt's (1996) "Concept Learning Systems,” but these are now buried beneath decades of active tinkering.

C5 creates binary splits on interval inputs and multiway splits on nominal inputs for a nominal target. The split is chosen that maximizes the gain ratio: Gain ratio = reduction in entropy / entropy of split. (Let P(b) denote the proportion of training cases a split assigns to branch b, b = 1 to B. The entropy of a split is defined as the entropy function applied to {P(b): b=1 to B}.) For nominal inputs, each category is first assigned to a unique branch, and then, in steps, two branches are merged, until only two branches exist. The split with the maximum gain ratio among splits examined becomes the candidate split for the input. This search method, of course, is heuristic and might not find the nominal split with the largest gain ratio. The search on an interval input will find the best split.

Cases with missing values are excluded from the split search on that input and also from the numerator of the gain ratio. The entropy of the split is computed as if missing values constitute an additional branch.

When a missing value prevents applying a splitting rule to a case, the case is replaced by B new ones, each being assigned to a different branch, and each is assigned a weight equal to the proportion of nonmissing training cases assigned to that branch. The posterior probabilities of the original case equals the weighted sum of the probabilities of the generated observations.

Ross Quinlan advocates retrospective pruning instead of stopping rules. If enough data were available, the pruning process would use data withheld from training the tree to compare error rates of candidate sub-trees.

The software does not assume data may be withheld from training, so it implements "pessimistic" pruning.

In each node, an upper confidence limit of the number of misclassified data is estimated assuming a binomial distribution around the observed number misclassified. The confidence limit serves as an estimate of the error rate on future data. The pruned tree minimizes the sum over leaves of upper confidences.

Some other features of C5 are, unequal misclassification costs, fuzzy splits on interval inputs, and boosting.

QUEST QUEST was introduced by Wei-Yin Loh and Yu-Shan Shih (1997) and is an acronym for "Quick, Unbiased, Efficient, Statistical Tree." To become quick and unbiased, this algorithm selects an input variable to split on before searching for a split, thereby eliminating the time required for searching on most inputs and eliminating the bias towards nominal inputs inherent when relying on candidate splitting rules to select the input variable.

A simplified version of the QUEST input selection is as follows. For an interval input, perform a J-way ANOVA, where J is the number of target values. For a nominal input with M values, compute a chi-square statistic from the J by M contingency table. Select the input with the smallest Bonferroni adjusted p-value.

If a nominal input is selected, it is transformed to an interval one. A linear discriminant analysis is then performed on the single selected input, and one of the discriminant boundary points becomes the split point.

Splits on linear combinations of inputs are also possible. QUEST searches for binary splits on nominal or interval inputs for a nominal target. Cases whose values are missing an input are excluded in calculations with that input. Surrogate rules assign such cases to branches. Recursive partitioning creates a large tree that is retrospectively pruned using cross validation.

The paper by Loh and Shih is available at http://www.stat.wisc.edu/~loh. The paper states that QUEST is much better than the exhaustive search methods of Breiman et al. (1984) in terms of variable selection bias and speed, but neither approach dominates in terms of classification accuracy, stability of split points, or size of tree. The idea of first selecting the variable using an F statistic and then using LDA to chose a split point had already appeared in the FACT tree algorithm by Loh and Vanichsetakul (1988).

OC1 OC1 was introduced by Murthy et al. (1994) and is an acronym for "Oblique Classifier 1.” The paper is available at http://www.cs.jhu.edu/~murthy.

OC1 searches for binary splits on linear combinations of interval inputs for a nominal target. A candidate split can be evaluated as a maximum or sum over branches of the count of the rarest target value or as the reduction in the sum of squared errors, treating the categorical target values as numeric. Missing values are not accepted. Recursive partitioning creates a tree that over-fits the data. The tree is then pruned retrospectively using a measure of node impurity.

A synopsis of the split search algorithm is this: Choose an initial linear combination at random. Apply a heuristic hill-climbing algorithm to find a local optimum. Make a random change or a random restart and climb again.

SAS algorithms SAS algorithms incorporate and extend most of the good ideas discussed for recursive partitioning with univariate splits. The target and inputs can be nominal, ordinal, or interval. The user specifies the maximum number of branches from a split, thus allowing binary trees, bushy trees, or anything in between.

Splits can be evaluated as a reduction in impurity (least squares, Gini index, or entropy) or as a test of significance (chi-square test or F test). Significance tests allow Bonferroni adjustments, as done in CHAID, to counter the bias of many categorical values and allow more adjustments for the number of inputs and the depth in the tree of the split.

The program always finds the best split on an input unless too many candidate splits would have to be examined. The user specifies this limit. The limited search on an input begins with a split into many branches and then proceeds in steps of merging pairs of branches. This heuristic search also considers switching input values between branches after every merge, thereby examining many more candidates than just merging (as done in C5 for example). "Switching" replaces the "respliting" finesse of CHAID. The best split examined is adopted, as done in "exhaustive CHAID" and not the original CHAID.

Missing values may optionally be treated as a special, acceptable value, as in CHAID. Surrogate rules, if suitable, assign cases with missing values to a branch, as in the algorithms of Breiman et al. (1984).

Many options control the stopping and retrospective pruning of the tree. As in CHAID, a limit on the level of significance may stop tree growth. After the tree stops growing, smaller trees are found that achieve the best assessment value for their size. The user has great latitude in specifying an assessment measure.

Unequal misclassification costs of Breiman et al. (1984) are implemented and extended to allow different costs for different cases and to let the user specify new predictable outcomes that do not appear as target values. The assessment may be done on the whole tree or on the half of the tree that predicts the most desired outcomes. Assessing only the better half of the tree gives an indication of the "lift" the tree produces.

The tree algorithms are included in the SAS Enterprise Miner data mining solution. The SAS Enterprise Miner provides a visual programming environment for predictive modeling. Prior probabilities, unequal misclassification costs, case-dependent costs, bagging and arcing apply to all of the modeling tools. The tree can incorporate prior probabilities into the splitting criterion or just use them to adjust the posterior probabilities. The tree can create an indicator variable for each leaf. These variables automatically enter into other models, such as regressions, placed after the tree.

## References

Berry, M and Linoff, G. (1997) provide a non-mathematical introduction to the methods of Breiman et al., Kass (CHAID), and Quinlan (C4.5). Ripley (1996) provides a mathematical introduction to trees.Devroye et al. (1996) provide an advanced mathematical introduction to trees.

http://www.recursive-partitioning.com provides an excellent bibliography, links to major tree and rule induction software web sites, and a few comparison studies. The following references are included in the bibliography on the web.

Alexander, W.P. and Grimshaw, S.D. (1996), "Treed Regression," Journal of Computational and Graphical Statistics 5 (2), 156-170. Their algorithm searches for splits that minimize squared errors from regression in branches. All regressions on one independent variable are considered for each split candidate.