# «Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

The problem for the analyst is to recognize the need to perform two regressions instead of one. For this purpose, some analysts first create a small decision tree from the data, and then run a separate regression in each leaf. This is called stratified regression. Unfortunately, the tree usually will not split data the way the analyst hopes.

Consider a data set with target Y and inputs X and W, where W=0 for one population, and W=1 for the other. Suppose Y = (a + b X) W + (c + d X) ( 1 - W) for some constants a, b, c, and d. The analyst will apply the tree algorithm to split the data and then fit a regression in each leaf. He hopes that the algorithm will split on W. The predicament is similar to that described for detecting interactions. The tree algorithm will try to separate extreme values of Y. If the values of Y in one population are all larger than those in the other population, then the algorithm will split on W. In the other extreme, if the average of Y is the same in the two populations, the algorithm will split on X. A standard tree creation algorithm separates the populations in the left diagram but not those in the right one. An algorithm designed specifically for stratified regression is needed. Alexander and Grimshaw (1996) and Neville (1999) discuss such algorithms.

The analyst may have to contend with missing values among the inputs. Decision trees that split on one input at a time are more tolerant to missing data than models such as regression that combine several inputs.

When combining several inputs, an observation missing any input must be discarded. For the simplest of tree algorithms, only observations that need to be excluded are those missing the input currently being considered to split on. They can be included when considering splitting on a different input.

If twenty percent of the data are missing at random, and there are ten inputs, then a regression can use only about ten percent of the observations, while a simple split search algorithm will use about eighty percent.

Algorithms that treat missing observations as a special value will use all the observations.

Common tree algorithms that exclude missing data during the split search handle them intelligently when predicting a new observation. The program of Breiman et al. (1984) applies surrogate rules as a backup when missing data prevent the application of the regular splitting rule. Ross Quinlan's (1993) C4.5 and C5 programs will take a weighted average of the predictions across all branches where missing data prevented a definitive assignment.

An analyst preparing for a regression with missing data might first replace the missing values with guesses.

This is called imputing. A natural approach is to fit a model to the nonmissing values to predict the missing ones.

Trees may be the best modeling tool for this purpose because of their tolerance to missing data, their acceptance of different data types, and their robustness to assumptions about the input distributions. For each regression input, construct at tree that uses the other inputs to predict the given one. If X, Y and Z represent the inputs, create a tree to predict X from Y and Z, another tree to predict Y from X and Z, and another to predict Z from X and Y.

While trees are probably the best choice in an imputation approach based on prediction, other approaches for handling missing values may be better for a given situation. For example, the variance of estimates based on imputed data may be smaller than it would be if none of the data were missing, depending on how the imputation is done. Imputing with random values drawn from a distribution with the correct variance may be best when the variances matter.

**Model interpretation**

Trees are sometimes used to help understand the results of other models. An example occurs in market research. A company may offer many products. Different customers are interested in different products.

One task of market research is to segregate the potential customers into homogeneous segments and then assign marketing campaigns to the segments. Typically, no response information is available on the customers and so no target variable exists.

Segmentation is based on similarities between input variables. People differ somewhat in their purchasing behavior depending on demographics: their age, family status, and where they live. Demographic information is relatively easy to obtain, and missing data can often be imputed using census information.

The segments are intentionally developed using variables for which complete data are available so that everybody can be assigned to a segment.

After the segments are built, the average age, income, and other statistics are available for each segment.

However, these demographic statistics are not very suggestive of what products to market. The next step is to select a sample from each segment and ask the people about their life-style and product preferences.

Finally, combine the samples from all the segments into one data set and create a tree using the survey questions as inputs and the segment number as the target. Using just a few segments with an equal number of people in each gives the best chance of obtaining a useful tree. With a little luck, the tree will characterize some segments by the type of clothing, cars, or hobbies that suggest what products people in the segment would like to purchase.

**Predictive modeling**

This section has described how to use trees to overcome some hurdles in predictive modeling. In each example, the tree helps prepare the data or interpret the results of another predictive model. Actually, none of the inventors of tree algorithms were motivated by such supportive roles.

Many individuals have come up with innovative tree creation algorithms. Important ones come from Morgan and Sonquist (1963), Kass (1980), Breiman et al. (1984), and Quinlan (1979). The disparate approaches rival each other. Yet the originators share the common inspiration that trees by themselves are effective predictive models. Each author can describe studies in which trees simply out perform other predictive techniques.

Consider the circumstances that motivated Morgan and Sonquist. They analyzed data in search of determinants of social conditions. In one example, they tried to untangle the influence of age, education, ethnicity, and profession on a person's income. Their best regression contained 30 terms (including interactions) and accounted for 36 percent of the variance.

As an alternative to regression, they organized the observations into 21 groups. The income of an observation in a group was estimated by the group mean. The groups were defined by values on two or three inputs. Nonwhite high school graduates had a mean income of $5,005. White farmers who did not finish high school had a mean income of $3,950, and so on. This method of prediction accounted for 67 percent of the variance.

The study is interesting because it reveals the inadequacy of regression to discern some relationships in data.

Of course every statistician knows this, but Morgan and Sonquist showed how common the problem is in social research and how easily trees get around it. They developed this point in a 1963 article in the Journal of the American Statistical Association. They then created the first statistical tree software and called it AID. The program and its successors stayed in service at the Survey Research Center for more than 30 years.

Trees do not supercede other modeling techniques. Different techniques do better with different data and in the hands of different analysts. However, the winning technique is generally not known until all the contenders get a chance. Trees should contend along with the others. Trees are easy.

How to Create a Tree The simplicity of a decision tree belies the subtleties of creating a good one. A comparable task of fitting a parametric statistical model to data often consists of formulating a log likelihood and solving it numerically.

Numerical optimization for trees is infeasible.

On the other hand, tree creation algorithms are easy to invent. Indeed, in the absence of mathematical theorems for judging the superiority of one method over another, many competing algorithms are available.

Despite this anarchy, the qualities that characterize a good tree model are the same as those of a parametric one. A parametric model assumes that the data share a common relationship between the inputs and the target, and assumes that the relationship is obscured by idiosyncrasies of individual cases (ineptly called "errors"). A good model reveals a relationship that, depending on the needs of the analyst, either accurately describes the data, accurately predicts the target in similar data, or provides insight. Mathematics forms the basis for trusting the model: if the analyst believes the assumptions of the model, he believes the results.

Trees are the same in all respects except one: trust in the model is gained from applying it to a fresh sample.

The following discussion surveys the anarchy of tree creation algorithms and points out what to consider for creating good trees. Murthy (1998) gives a much more extensive survey of algorithmic ideas.

An easy algorithm Inventing a new algorithm for creating a tree is easy. Here is one.

Find the input variable X that most highly correlates with the target. Tentatively split the values of X into two groups and measure the separation of the target values between the groups. For an interval target, use a Student’s t statistic as the measure. For a categorical target, use a chi-square measure of association between the target values and the groups. Find a split on X with the largest separation of target values. Use this split to divide that data. Repeat the process in each group that contains at least twenty percent of the original data. Do not stop until every group of adequate size that can be divided is divided.

For many data sets, this algorithm will produce a useful tree. However, its useful scope is limited. Several choices inherent in the algorithm may have more effective alternatives. The most dangerous choice is the stopping rule that may result in predictions that are worse than having no tree at all. The remainder of this

**section discusses the following topics:**

◊ Selection of a splitting variable ◊ Number of branches from a split ◊ Elusive best splits ◊ Stepwise, recursive partitioning ◊ Stopping or pruning ◊ Multivariate splits ◊ Missing values Selection of a splitting variable The easy algorithm presented in the preceding section first selects a splitting variable and then searches for a good split of its values. Almost all the popular algorithms search for a good split on all the inputs, one at a time, and then select the input with the best split. This seems more reasonable because the goal is to find the best split.

However, Loh and Shih (1997) report that splitting each variable significantly biases the selection towards nominal variables with many categories. Moreover, the algorithms that search for a good split on each variable take longer. Neither approach dominates in terms of classification accuracy, stability of split, or size of tree. Thus, the issue may be of concern if the goal of the tree creation is interpretation or variable selection, and not prediction.

Loh and Shih did not consider the possibility of searching for a split on each input and then penalizing those splits that are prone to bias when comparing the different inputs. The CHAID algorithm due to Gordon Kass (1980) does this by adjusting p-values.

Number of branches The easy algorithm always splits a node into two branches so as to avoid having to decide what an appropriate number of branches would be. But this easy approach poorly communicates structure in the data if the data more naturally split into more branches. For example, if salaries are vastly different in California, Hawaii, and Nebraska, then the algorithm ought to separate the three states all at once when predicting salaries. Gordon Kass seemed to think so when he commented that binary splits are often misleading and inefficient.

On the other hand, several practitioners advocate using binary splits only. A multiway split may always be accomplished with a sequence of binary splits on the same input. An algorithm that proceeds in binary steps has the opportunity to split with more than one input and thus will consider more multistep partitions than an algorithm can consider in a single-step multiway split. Too often the data do not clearly determine the number of branches appropriate for a multiway split. The extra branches reduce the data available deeper in the tree, degrading the statistics and splits in deeper nodes.

Elusive best splits The easy algorithm blithely claims it finds the split on the selected variable that maximizes some measure of separation of target values. Sometimes this is impossible. Even when it is possible, many algorithms do not attempt it.

To understand the difficulty, consider searching for the best binary split using a nominal input with three categories. The number of ways of assigning three categories to two branches is two times two times two, or eight. The order of the branches does not matter, so only half of the eight candidate splits are distinct. More generally, the number of distinct binary splits of C nominal categories is two times itself C-1 times, written 2^(C-1). Many more splits are possible on inputs with more nominal categories, and this creates a bias towards selecting such variables.

When the target is binary or interval, the best split can be found without examining all splits: order the categories by the average of the target among all observations with the same input category, then find the best split maintaining the order. This method works only for binary splits with binary or interval targets.

For other situations, all the candidate splits must be examined.