# «Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

Many available algorithms do not attempt to find the best split on a variable. Such methods of searching are called heuristic because they do not guarantee optimality. For a nominal input variable, a common theme is first to form a split by assigning a different branch to each input category, and then to derive a sequence of candidate splits by combining two branches of one split to produce the next split in the sequence. Different algorithms differ in the details and variations on this theme. Most heuristic algorithms adopt the best split examined. The original CHAID algorithm may be the only exception. CHAID adopts the last split examined.

**Recursive partitioning**

The easy algorithm first splits the entire data set, the root node, into two nodes. Each node is then split into more nodes, and so on. Each node in its turn is considered in isolation. The search for a split is done without any concern as to how another node is to be split. This process is called recursive partitioning and the great majority of tree creation algorithms do it.

Recursive partitioning is myopic: it focuses on optimizing individual splits and pays no attention to the quality of the entire tree. Alternatives to recursive partitioning have not become popular though. They are hard to implement and take much longer to run. Moreover, recursive partitioning generally gives adequate results.

The first attempt to cure the myopia is to "look ahead one generation" before committing to a split. A candidate split creates two branches. A binary split is then found in each branch. The evaluation of the candidate split is based on the four nodes in the second generation. James Morgan and John Sonquist implemented look-ahead split searches in the AID program in the 1960s. The user's manual by Sonquist et al. (1971) indicates that looking ahead does not help much. A recent, more extensive study by Murthy came to the same conclusion and reported that look-ahead splits sometimes created worse trees.

Chipman, George, and McCullogh (1998) describe an algorithm that searches for an optimal tree. Instead of evaluating candidate splits, their algorithm evaluates candidate trees. One tree is derived from another using a simple operation. The operation is selected at random from a set of possible operations. An example of an operation is selecting a particular variable and forming a random split on that variable. Another operation is to prune a split, transforming an internal node into a leaf. Thus, some operations create larger trees, some create smaller trees. The algorithm is sometimes successful for small trees. Whether the algorithm can examine enough large trees to be useful remains to be seen.

Kristin Bennett (1994) implemented an algorithm for finding globally optimal splits in a tree given the tree structure. The structure consists of the arrangement of nodes and the predicted values assigned to each leaf.

The splits are initially undefined, and therefore the nodes are not initially associated with data. The algorithm defines splits that segregates the data into the nodes in a way that is most consistent with the preassigned predictions in the leaves. A recursive partitioning algorithm can create an initial structure.

**Stopping**

The size of a tree may be the most important single determinant of quality, more important, perhaps, than creating good individual splits. Trees that are too small do not describe the data well. Trees that are too large have leaves with too little data to make any reliable predictions about the contents of the leaf when the tree is applied to a new sample. Splits deep in a large tree may be based on too little data to be reliable.

The easy algorithm tries to avoid making trees too small or too large by splitting any node that contains at least twenty percent of the original data. This strategy is easily foiled. Twenty percent of the data will be too few or too many when the original data contain few or many cases. With some data sets, the algorithm will split a node into branches in which one branch has only a couple of cases. Many splits are required to whittle the thick branches down to the limit of twenty percent, so most of the data will end up in tiny leaves.

**Certain limitations on partitioning seem necessary:**

◊ the minimum number of cases in a leaf ◊ the minimum number of cases required in a node being split ◊ the depth of any leaf from the root node Appropriate values for these limits depend on the data. For example, a search for a reliable split needs less data the more predictable the data are. A stopping rule that accounts for the predictive reliability of the data seems like a good idea. For this purpose, the CHAID algorithm applies a test of statistical significance to candidate splits. Partitioning stops when no split meets a threshold level of significance. Unfortunately, this stopping rule turns out to be problematic in three related aspects: choice of statistical test, accommodations for multiple comparisons, and the choice of a threshold.

A test of significance is problematic first because a test must assume some joint distribution of the target and an input. The assumptions become important in the smaller nodes, which are the more numerous down the tree. The appropriate distribution is generally unknown. Worse, the distributions will differ in different nodes when the tree segregates different subpopulations down different branches.

For algorithms that search for a split on every input variable, the test of significance is performed on the maximum value of the splits on the various inputs. Using the distribution of the extreme values of V test statistics, where V equals the number of inputs, is more appropriate than using the distribution for a single test statistic. Such tests seem impractical for computational reasons. Even if an appropriate test were used, the test would be applied thousands of times in the same tree: once for each input in each node. These criticisms of using significance tests persuade some people but not others. Advocates of the tests adjust the threshold significance levels to accommodate multiple tests.

Pruning Many practitioners contend that stopping rules cannot work. The inherent fallacy is the assumption that an appropriate threshold may be set without much understanding of the data.

The alternative is to grow a tree that is too large, and then prune the tree to the right size. All candidate subtrees pruned from the large tree are available at the same time for comparison, and this gives retrospective pruning an advantage over a stopping rule that can consider only a single node at one time. The pruning process may evaluate entire subtrees instead of individual splits. An evaluation measure may be chosen that is more relevant to the end use of the tree than the splitting criterion. The proportion of cases misclassified is a common choice.

Stopping or pruning is necessary to avoid over-fitting the data. A model "over-fits" when the good fit to the training data (the data used to create the model) is not replicated when the model is applied to a different sample. Ideally, the pruning process uses a separate "pruning data set" to evaluate sub-trees. Subtrees that over-fit the training data will evaluate poorly on the pruning data. Ross Quinlan (1987) may have published the first algorithm that finds the subtree with the smallest error on pruning data. He calls it reduced-error pruning.

The original pruning algorithm is due to Breiman et al. (1984). Evaluating all possible subtrees may have been too time consuming in the 1970s, and so the authors found a fast method of creating a sequence of subtrees of different sizes. Each subtree is a subtree of all the larger ones in the sequence. More importantly, each subtree obtains the optimum assessment among all possible subtrees of its size, based on the training data. Each subtree in the sequence is then applied to a fresh sample, and the larger tree is pruned to the winning subtree. This method is called cost-complexity pruning, and results in different trees than reducederror pruning.

Sometimes tree creation must use all the available data, so no pruning data exist. Ross Quinlan's C4.5 and C5 artificially inflate the error rate of the training data in each node and evaluate subtrees with the inflated rate. This is called pessimistic pruning. Breiman et al. use a cross validation method to choose the subtree from their nested sequence of subtrees.

** Multivariate splits**

The easy algorithm uses a single input to define a splitting rule. Such univariate splits are easy to understand, are tolerant of missing data (because missing values in any other input are of no concern), and implicitly select a small number of inputs to create an entire tree, thereby reducing subsequent demands on data collection and analysis.

Defining splitting rules with many variables also has merit. The OC1 algorithms of Murthy et al. (1994) and the SVM tree algorithms of Bennett (1997) both base splits on linear combinations of variables and produce trees that are often more accurate and smaller than univariate trees. Smaller trees are easier to understand.

Smaller trees segregate the data into fewer leaves. Describing the data in a leaf in terms of the variables defining the leaf is cumbersome with many defining variables, whether the leaf is in a large tree with univariate splits or a small tree with multivariate splits. The leaves may be easier to understand by plotting the distribution of variables in the leaves, as done in clustering. Having fewer leaves to profile is helpful.

**Missing values**

Decision trees accommodate missing values very well compared to other modeling methods. The simplest strategy is to regard a missing value as a special nonmissing value. For a nominal input, missing values simply constitute a new categorical value. For an input whose values are ordered, the missing values constitute a special value that is assigned a place in the ordering that yields the best split. The place is generally different in different nodes of the tree.

The strategy pays off when missing values are indicative of certain target values. For example, people with large incomes might be more reluctant to disclose their income than people with ordinary incomes. If income were predictive of a target, then missing income would be predictive of the target, and the missing values would be regarded as a special large income value. The strategy seems harmless when the distribution of missing values is uncorrelated with the target because no choice of branch for the missing values would help predict the target.

Regarding missing values as a special value is sometimes inappropriate. If a large proportion of values is missing, then they may unduly influence the creation of the split. Evaluating a split using only known information improves credibility. Excluding observations with missing values is feasible with univariate trees because only observations missing on a single input are excluded at any one time. This compares favorably with other modeling techniques that exclude observations missing any input value, which may leave very few usable observations.

When missing values are excluded from the split search, a new policy is needed for assigning missing observations to branches. One policy is to use surrogate splitting rules. A surrogate rule substitutes for the main rule when the main rule cannot handle an observation. A good surrogate rule is one that mimics the main rule very well, even if it does not define a good split in its own right. For example, the main rule might split on county, and the surrogate might split on region. The surrogate applies to observations with an unknown county and a known region. The surrogate might be less effective as a main splitting rule because region represents coarser information than county.

The surrogate policy relies on redundant inputs. Another policy is needed when no good surrogate exists.

Assigning all the observations with missing values to a single branch is apt to reduce the purity of the branch, thereby degrading the split. If this is unavoidable, assigning the largest branch results in the least dilution of node purity. Another idea is to split an individual observation and to assign the pieces to all branches in proportion to the size of each branch. For a binary split, the observation is replaced by two observations with fractional sizes. One fractional observation is assigned to each branch. The prediction for an observation is the weighted average of predictions of the derived fractional observations.

What Can Go Wrong Trees may deceive. They may fit the data well but then predict new data worse than having no model at all.

This is called over-fitting. They may fit the data well, predict well, and convey a good story, but then, if some of the original data are replaced with a fresh sample and a new tree is created, a completely different tree may emerge using completely different inputs in the splitting rules and consequently conveying a completely different story. This is called instability.

A little education renders all this harmless. Over-fitting is the more serious issue. The remainder of this section discusses how to detect and avoid it. Instability is a simpler issue. It occurs when two inputs have similar predictive ability. The tree can choose only one. The story conveyed is incomplete. The solution is simply to understand that the story is too simplistic when conflicting variables have comparable predictive ability. Instability is actually beneficial for the new prediction techniques based on re-sampling. The phenomenon is discussed fully in the section “The Frontier.”

**Too good a fit**

Over-fitting occurs when the tree fits random variations of the target values in the training data that are not replicated in other samples. The tree fits the training data deceptively well. Over-fitting could be largely due to a single spurious split or to the accumulation of small errors from many splits. Splits generally perform better on the training data than on new data. Even a split representing a `true' prediction performs slightly worse on new data because some randomness exists in the data and the split fits some of that.

Smaller nodes are more prone to mistakes.