# «Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

Decision Trees for Predictive Modeling

Padraic G. Neville SAS Institute Inc. 4 August 1999

What a Decision Tree Is...................2

What to Do with a Tree................... 3

Variable selection

Variable importance

Interaction detection

Stratified modeling

Missing value imputation

Model interpretation

Predictive modeling

How to Create a Tree..................... 8

An easy algorithm

Selection of a splitting variable

Number of branches Elusive best splits Recursive partitioning Stopping Pruning Multivariate splits Missing values What Can Go Wrong.....................14 Too good a fit Spurious splits due to too many variables Spurious splits due to too many values The Frontier............................15 Combining models Ensembles Unstable algorithms Bagging Arcing Boosting Well-Known Algorithms..................18 AID, SEARCH CHAID Classification and Regression Trees ID3, C4.5, C5 QUEST OC1 SAS algorithms References.............................22 What a Decision Tree Is A decision tree as discussed here depicts rules for dividing data into groups. The first rule splits the entire data set into some number of pieces, and then another rule may be applied to a piece, different rules to different pieces, forming a second generation of pieces. In general, a piece may be either split or left alone to form a final group.

The tree depicts the first split into pieces as branches emanating from a root and subsequent splits as branches emanating from nodes on older branches. The leaves of the tree are the final groups, the unsplit nodes. For some perverse reason, trees are always drawn upside down, like an organizational chart. For a tree to be useful, the data in a leaf must be similar with respect to some target measure, so that the tree represents the segregation of a mixture of data into purified groups.

Consider an example of data collected on people in a city park in the vicinity of a hotdog and ice cream stand. The owner of the concession stand wants to know what predisposes people to buy ice cream. Among all the people observed, forty percent buy ice cream. This is represented in the root node of the tree at the top of the diagram. The first rule splits the data according to the weather. Unless it is sunny and hot, only five percent buy ice cream. This is represented in the leaf on the left branch. On sunny and hot days, sixty percent buy ice cream. The tree represents this population as an internal node that is further split into two branches, one of which is split again.

40% Buy ice cream

The tree displayed has four leaves. The data within a leaf are much more similar with respect to buying ice cream than the overall data with the 40/60 mix. According to the tree, people almost never buy ice cream unless the weather cooperates and either (1) they have some extra money to spend or (2) they crave the ice cream and presumably spend money irresponsibly or figure out another way of buying it. The tree is easy to understand even if the implication of craving ice cream is left to the imagination. Simple decision trees are appealing because of their clear depiction of how a few inputs determine target groups.

Trees are also appealing because they accept several types of variables: nominal, ordinal, and interval.

Nominal variables have categorical values with no inherent ordering. Ordinal variables are categorical with ordered values, for example: `cold', `cool', `warm', and `hot'. Interval variables have values that can be averaged. Temperature is an interval variable when its values are expressed in degrees. A variable can have any type regardless of whether it serves as the target or as an input. (The input variables are those available for use in splitting rules.) Other virtues of trees include their robustness to missing values and distribution assumptions about the inputs.

Trees also have their short comings. When the data contain no simple relationship between the inputs and the target, a simple tree is too simplistic. Even when a simple description is accurate, the description may not be the only accurate one. A tree gives an impression that certain inputs uniquely explain the variations in the target. A completely different set of inputs might give a different explanation that is just as good. The issues of descriptive accuracy, uniqueness, and reliability of prediction are extensively discussed in this paper.

The adjective 'decision' in "decision trees" is a curious one and somewhat misleading. In the 1960s, originators of the tree approach described the splitting rules as decision rules. The terminology remains popular. This is unfortunate because it inhibits the use of ideas and terminology from "decision theory". The term "decision tree" is used in decision theory and project management to depict a series of decisions for choosing alternative activities. The analyst creates the tree and specifies probabilities and benefits of outcomes of the activities. The software finds the most beneficial path. The project follows a single path and never performs the unchosen activities.

Decision theory is not about data analysis. The choice of a decision is made without reference to data. The trees in this article are only about data analysis. A tree is fit to a data set to enable interpretation and prediction of data. An apt name would be: data splitting trees.

What to Do with a Tree Prediction is often the main goal of data analysis. Creating a predictive model is not as automatic as one might hope. This section describes some of the practical hurdles involved and how trees are sometimes used to overcome them.

Variable selection Data often arrive at the analyst's door with lots of variables. The baggage sometimes includes a dictionary that makes uninteresting reading. Yet the analyst must find something interesting in the data. Most of the variables are redundant or irrelevant and just get in the way. A preliminary task is to determine which variables are likely to be predictive.

A common practice is to exclude input (independent) variables with little correlation to the target (dependent) variable. A good start perhaps, but such methods take little notice of redundancy among the variables or of any relationship with the target involving more than one input. Nominal variables are handled awkwardly.

An alternative practice is to use inputs that appear in splitting rules of trees. Trees notice relationships from the interaction of inputs. For example, buying ice cream may not correlate with having extra money unless the weather is sunny and hot. The tree noticed both inputs. Moreover, trees discard redundant inputs.

Sunny-and-hot and ozone-warnings might both correlate with buying ice cream, but the tree only needs one of the inputs. Finally, trees treat nominal and ordinal (categorical) inputs on a par with interval (continuous) ones. Instead of the categorical sunny-and-hot input, the split could be at some value of temperature, an interval input.

The analyst would typically use the selected variables as the inputs in a model such as logistic regression. In practice trees often provide far fewer variables than seem appropriate for a regression. The sensible solution is to include some variables from another technique, such as correlation. No single selection technique is capable of fully prophesizing which variables will be effective in another modeling tool.

Hard-core tree users find ways to get more variables out of a tree. They know how to tune tree creation to make large trees. One method uses the variables in 'surrogate' splitting rules. A surrogate splitting rule mimics a regular splitting rule. Surrogates are designed to handle missing values and are discussed more in that context. Using them to select variables is an afterthought. A surrogate variable is typically correlated with the main splitting variable, so the selected variables now have some redundancy.

**Variable importance**

The analyst may want the variable selection technique to provide a measure of importance for each variable, instead of just listing them. This would provide more information for modifying the selected list to accommodate other considerations of the data analysis. Intuitively, the variables used in a tree have different levels of importance. For example, whether or not a person craves ice cream is not as important as the weather in determining whether people will buy ice cream. While craving is a strong indication for some people, sunny and hot weather is a good indicator for many more people. What makes a variable important is the strength of the influence and the number of cases influenced.

A formulation of variable importance that captures this intuition is as follows: (1) Measure the importance of the model for predicting an individual. Specifically, let the importance for an individual equal the absolute value of the difference between the predicted value (or profit) of the individual with and without the model.

(2) Divide the individual importance among the variables used to predict the individual, and then (3) average the variable importance over all individuals.

To divide the importance for an individual among the variables (2), note that only the variables used in splitting rules that direct the individual from the root down the path to the leaf should share in the importance. Apportion the individual importance equally to the rules in the path. If the number of such rules is D, and no two rules use the same variable, then each variable is accredited with 1/D times the individual importance. This formulation ensures that the variable used to split the root node will end up being the most important, and that a variable gets less credit for individuals assigned to leaves deeper down the tree. Both consequences seem appropriate.

Currently no software implements this formulation of importance. Some software packages implement an older formulation that defines the importance of a splitting rule, credits that importance to the variable used in the rule, and then sums over all splitting rules in the tree. For an interval target, the importance of a split is the reduction in the sum of squared errors between the node and the immediate branches. For a categorical target, the importance is the reduction in the Gini index. Refer to Breiman et al. (1984) and Friedman (1999) for discussions about this approach.

Interaction detection Having selected inputs to a regression, the analyst will typically consider possible interaction effects.

Consider modeling the price of single family homes. Suppose that the prices of most houses in the data set are proportional to a linear combination of the square footage and age of the house, but houses bordering a golf course sell at a premium above what would be expected from their size and age. The best possible model would need a golf course indicator as input. Data rarely come with the most useful variables.

However, it seems plausible that the houses bordering the golf course are about the same size and were built about the same time. If none of the other houses of that size were built during that time, then that size and time combination provides an indication of whether the house borders the golf course. The regression should contain three variables: square footage, age, and the golf course indicator. The indicator is constructed from square footage and age and, therefore, represents an interaction between those two inputs.

The analyst of course is not told anything about any golf course, much less given any clues as to how to construct the indicator. The seasoned analyst will suspect that key information has to be inferred indirectly, but what and how is guess work. He will try multiplying size and age. No luck. He might then try creating a tree and creating an indicator for each leaf. For a particular observation, the indicator equals one if the observation is in the leaf and equals zero otherwise. The regression will contain square footage, age, and several indicators, one for each leaf in the tree. If the tree creates a leaf with only the houses bordering the golf course, then the analyst will have included the right interaction effects. The indicators for the other leaves would not spoil the fit. Indicators for nonleaf nodes are unnecessary because they equal the sum of indicators of their descendents.

This technique works when the influence of the interaction is strong enough, but it does not work otherwise.

Consider a data set containing one target Y and one input X that are related through the equation

where a and b are constants, and W is like the golf course indicator: it takes values 0 and 1, it is omitted from the data set, but it is 1 if and only if X is in some interval. The analyst must infer the influence of W on Y using only X.

The diagram illustrates the situation in which W equals one when X is near the middle of its range. The tree creation algorithm searches for a split on X that best separates data with small values of Y from large ones.

If b is large enough, values of Y with W = 1 will be larger than values of Y with W = 0, and the algorithm will make the first two splits on X so as to isolate data with W = 1. A tree with three leaves appears in which one leaf represents the desired indicator variable.

** Y**

X When b is smaller so that data with extreme values of Y have W = 0, the data with W = 1 have less influence on the algorithm. A split that best separates extreme values of Y is generally different from a split that best detects interactions in linear relationships. A tree creation algorithm specially designed for discovering interactions is needed. Otherwise, even if the W = 1 data is eventually separated, the tree will have many superfluous leaves.

Stratified modeling The analyst preparing for a regression model faces another hidden pitfall when the data represent two populations. A different relationship between an input and the target may exist in the different populations.

In symbols, if Y = a + b X + e and Y = a + c X + e express the relationship of Y to X in the first and second population respectively, and b is very different from c, then one regression alone would fit the combined data poorly.