# «Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

The surest way to detect over-fitting is to apply the tree to new data. To detect what size of a tree provides a good predictive fit, prune the large tree to subtrees of several sizes, and apply all the subtrees to the new data. Compare the fit to the new data as the tree size increases. Prune to the sub-tree where the fit flattens out. In practice, the choice of subtree is not always clear-cut. The improvement in fit will decrease gradually, sometimes bouncing up and down around a trend. The analyst must decide on the details for choosing the tree size.

**Spurious splits due to too many variables**

An appealing feature of trees with univariate splits is the resilience against the curse of dimensionality: faced with many inputs, the tree creation algorithm finds a few with predictive ability. Even with more inputs than observations, an algorithm can create a useful tree. In successful examples like these, inputs with predictive ability beat out the inputs that might have only a chance association with the target. The more inputs that are irrelevant, the greater the chance that random association of some irrelevant input will be with the target, and the larger the true association of a predictive input must be to be selected for a splitting rule. When searching for a few good inputs among a thousand or more irrelevant ones, the creation of spurious splits is an important concern.

For a good split, the distribution of new data in the branches will be similar to the distribution of the training data. A spurious split will have almost no worth when evaluated on new data, and therefore the distributions of the new data in the branches will be similar to those in the node being split, and not like the distributions of the training data in the branches. Consequently, spurious splits are easy to detect when enough new data are available.

When spurious splits are detected retrospectively, the only remedy is to prune them. However, the same spurious splits could be avoided by using the new data during the split search. The algorithm would use the training data to select a split for each input and then use the new data to compare the selected splits between inputs.

When significance tests are used to evaluate a split, the test can be modified to account for the number of inputs. A simple way to do this is to multiply the p-value of a test statistic by the number of inputs before comparing the statistic to a threshold level of significance. A better method is to base the p-value on the distribution of the best V test statistics, where V equals the number of inputs. Both of these methods work by making it harder to accept any individual split. The more irrelevant inputs, the better a truly predictive split must be to be accepted. Modifying a test for more and more inputs eventually prevents accepting any input, no matter how good.

The best strategy for avoiding spurious results from too many inputs may be to reduce the number of inputs to begin with. Although trees are resilient against many inputs, they are not complete cures.

Spurious splits due to too many nominal values Spurious splits are also common using nominal inputs with many values. Consider an extreme example in which an input contains a different value for each case, as would occur with a customer id number. The algorithm is free to assign a case to any branch. The algorithm strives to assign cases with different target values to different branches. With a binary target and customer id as an input, the algorithm will do so. Of course, the splitting rule is completely unreliable.

Reliability stems ultimately from repetition: if snow has fallen every January in a certain city every year for the past ten years, it is likely to snow there next year also. If many cases occur with a particular nominal input value, and if the target value is the same for most of them, then the input value predicts the target value somewhat reliably. Even if cases with the input value have diverse target values so that the input value poorly predicts the target, no algorithm can use the input to untangle the target, and so no algorithm can fake a good prediction. The key for finding a reliable nominal split is to use only input values that are represented with many cases. Treat input values with too few cases as if they were missing values. The number of cases depends on the distribution of the target values for the given input value.

An alternative strategy to avoid spurious splits with nominal inputs is to use a significance test that takes account of the number of input values. The worth of a split is penalized more for using a nominal input with more values. The consequences are similar to those resulting from adjusting significance tests for many inputs: inputs with many nominal values are effectively eliminated. Other strategies are to combine values so as to reduce their number or just to exclude the input at the outset.

The Frontier An important new technique is emerging from research in predictive modeling: strategically re-sampling the data and fitting a model to each sample produces an ensemble that is more accurate and stable than any single model. Decision trees have been the preferred model for comprising the ensemble. The instability of trees, so disastrous for interpretation, is an asset for creating ensembles. The rest of this section explains the phenomenon more fully.

** Combining models**

An average of several measurements is often more accurate than a single measurement. This happens when the errors of individual measurements more often cancel each other than reinforce each other. An average is also more stable than an individual measurement: if different sets of measurements are made on the same object, their averages would be more similar than individual measurements in a single set are.

A similar phenomenon exists for predictive models: a weighted average of predictions is often more accurate and more stable than an individual model prediction. Though similar to what happens with measurements, it is less common and more surprising. A model relates inputs to a target. It seems surprising that a better relationship exists than is obtainable with a single model. Combining the models must produce a relationship not obtainable in any individual model.

An algorithm for training a model assumes some form of the relationship between the inputs and the target.

Logistic regression assumes a linear relation. Decision trees assume a constant relation within ranges of the inputs. Neural networks assume a specific nonlinear relationship that depends on the architecture and activation functions chosen for the network.

Combining predictions from two different algorithms may produce a relationship of a different form than either algorithm assumes. If two models specify different relationships and fit the data well, their average is apt to fit the data better. If not, an individual model is apt to be adequate. In practice, the best way to know is to combine some models and compare the results.

**Ensembles**

An "ensemble" or "committee" is a collection of models regarded as one combined model. The ensemble predicts an observation as an average or a vote of the predictions of the individual model predictions. The different individual models may give different weights to the average or vote.

For an interval target, an ensemble averages the predictions. For a categorical target, an ensemble can average the posterior probabilities of the target values. Alternatively, the ensemble can classify an observation into the class that most of the individual models classify it. The latter method is called voting and is not equivalent to the method of averaging posteriors. Voting produces a predicted target value but does not produce posterior probabilities consistent with combining the individual posteriors.

**Unstable algorithms**

Sometimes applying the same algorithm to slightly different data produces a very different model. Stepwise regression and tree-based models behave this way when two important inputs have comparable predictive ability. When a decision tree creates a splitting rule, only one input is chosen. Changing the data slightly may tip the balance in favor of choosing another input. A split on one input might segregate the data very differently than a split on the other input. In this situation, all descendent splits are apt to be different.

The unstable nature of tree-based models renders the interpretation of trees tricky. A business may continually collect new data, and a tree created in June might look very different than one created the previous January. An analyst who depended on the January tree for understanding the data is apt to become distrustful of the tree in June, unless he investigated the January tree for instability. The analyst should check the competing splitting rules in a node. If two splits are comparably predictive and the input variables suggest different explanations, then neither explanation tells the whole story.

** Bagging**

**The instability of an algorithm makes possible a different strategy for creating an ensemble of models:**

instead of varying the algorithm, apply the same algorithm to different samples of the original data. If the algorithm is unstable, the different samples yield different models. The average of the predictions of these models might be better than the predictions from any single model. If the algorithm is stable, then the different samples yield similar models, and their average is no better than a single model.

The usual procedure for creating one of the sampled training sets is as follows. Pick a data case at random from the original training data set and copy it to the sampled data set. Repeat this a second time. The same case might be copied twice, but more likely two different cases will be chosen. Keep repeating this process until the sampled data set contains N cases, where N is arbitrary but is usually equal to the number of cases in the original training data set. This process is called sampling with replacement.

"Bagging" refers to the creation of an ensemble of models using the same algorithm on data sets sampled with replacement from a single training data set. Bagging stands for "bootstrap aggregation," and it is the invention of Leo Breiman (1996). He uses the voting method for classification.

**Arcing**

Bagging uses independent samples of the original data. Arcing takes a sample that favors cases that the current ensemble predicts relatively poorly. The first model trains on all of the original data. Cases that the first model incorrectly classify are preferentially sampled for the second training data set. A second model is trained, and the two models are combined. Cases for which the combined model performs poorly are preferentially sampled for the third training data set, and so on.

Leo Breiman (1998) introduced the term "arcing" to stand for "adaptive resampling and combining," and this seemed a better description of the process of "boosting" introduced by Freund and Schapire (1995, 1996).

Unlike bagging, arcing may give different weights to the individual models. Specifying the weights and the probabilities for sampling a case completely specifies an arcing algorithm. No particular specification seems generally superior to others in practice. Arcing applies only to categorical targets.

**Boosting**

The term "boosting" is much more commonly used than "arcing" and generally means the same thing.

However, Friedman, Hastie, and Tibshirani (1998, 1999) have reformulated the concept without reformulating a new name. Boosting, as reformulated, creates an ensemble consisting of a weighted average of models fit to data in which the target values are changed in a special way. In all other respects, the data are the same for all models.

As an example, consider minimizing the sum of squared errors between an interval target and a prediction from an ensemble. The first individual model is fit to the original data. For the training data set for the second model, subtract the predicted value from the original target value, and use the difference as the target value to train the second model. For the third training set, subtract a weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on.

The weights used in averaging the individual models and the prescription for creating a target value depend on the measure of loss associated with the problem. The prescription just described is used when minimizing the sum of squared errors. A different prescription is used when minimizing the sum of absolute errors. Thus, the measure of loss determines the particulars of a boosting algorithm.

Well-Known Algorithms In conversations about trees, someone is apt to refer to such-and-such algorithm. The following list contains the best known algorithms and describes how they address the main issues in this paper. Each algorithm was invented by a person or group inspired to create something better than what currently existed. Some of their opinions on strategies for creating a good tree are included. The last entry is SAS algorithms. The SAS software lets the user mix some of the better ideas of the well-known programs. It is still being improved.

**AID, SEARCH**

James Morgan and John Sonquist proposed decision tree methodology in an article in the Journal of the American Statistical Association in 1963. They cite similar ideas of Belson (1959) and others. Their motivation has already been discussed under "Predictive modeling" in the section, "What to Do with a Tree.” Their original program was called AID, which is an acronym for "automatic interaction detection."

AID represents the original tree program in the statistical community. The third version of the program was named SEARCH and was in service from about 1971 to 1996.

The user's manual for SEARCH (1971, 1975) discusses experiments with tree methodology. The most successful idea was to specify an independent variable for a linear regression, and to search for splits that either optimize the regression fit in the branches or most differentiate slopes of the regression lines in the branches. Often an input variable "so dominates the dependent variable that the data are split on little else."

The regression technique removes the dominant effect and reveals what else matters.