FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:     | 1 || 3 | 4 |   ...   | 6 |

«Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

-- [ Page 2 ] --

The problem for the analyst is to recognize the need to perform two regressions instead of one. For this purpose, some analysts first create a small decision tree from the data, and then run a separate regression in each leaf. This is called stratified regression. Unfortunately, the tree usually will not split data the way the analyst hopes.

Consider a data set with target Y and inputs X and W, where W=0 for one population, and W=1 for the other. Suppose Y = (a + b X) W + (c + d X) ( 1 - W) for some constants a, b, c, and d. The analyst will apply the tree algorithm to split the data and then fit a regression in each leaf. He hopes that the algorithm will split on W. The predicament is similar to that described for detecting interactions. The tree algorithm will try to separate extreme values of Y. If the values of Y in one population are all larger than those in the other population, then the algorithm will split on W. In the other extreme, if the average of Y is the same in the two populations, the algorithm will split on X. A standard tree creation algorithm separates the populations in the left diagram but not those in the right one. An algorithm designed specifically for stratified regression is needed. Alexander and Grimshaw (1996) and Neville (1999) discuss such algorithms.

–  –  –

The analyst may have to contend with missing values among the inputs. Decision trees that split on one input at a time are more tolerant to missing data than models such as regression that combine several inputs.

When combining several inputs, an observation missing any input must be discarded. For the simplest of tree algorithms, only observations that need to be excluded are those missing the input currently being considered to split on. They can be included when considering splitting on a different input.

If twenty percent of the data are missing at random, and there are ten inputs, then a regression can use only about ten percent of the observations, while a simple split search algorithm will use about eighty percent.

Algorithms that treat missing observations as a special value will use all the observations.

Common tree algorithms that exclude missing data during the split search handle them intelligently when predicting a new observation. The program of Breiman et al. (1984) applies surrogate rules as a backup when missing data prevent the application of the regular splitting rule. Ross Quinlan's (1993) C4.5 and C5 programs will take a weighted average of the predictions across all branches where missing data prevented a definitive assignment.

An analyst preparing for a regression with missing data might first replace the missing values with guesses.

This is called imputing. A natural approach is to fit a model to the nonmissing values to predict the missing ones.

Trees may be the best modeling tool for this purpose because of their tolerance to missing data, their acceptance of different data types, and their robustness to assumptions about the input distributions. For each regression input, construct at tree that uses the other inputs to predict the given one. If X, Y and Z represent the inputs, create a tree to predict X from Y and Z, another tree to predict Y from X and Z, and another to predict Z from X and Y.

While trees are probably the best choice in an imputation approach based on prediction, other approaches for handling missing values may be better for a given situation. For example, the variance of estimates based on imputed data may be smaller than it would be if none of the data were missing, depending on how the imputation is done. Imputing with random values drawn from a distribution with the correct variance may be best when the variances matter.

Model interpretation

Trees are sometimes used to help understand the results of other models. An example occurs in market research. A company may offer many products. Different customers are interested in different products.

One task of market research is to segregate the potential customers into homogeneous segments and then assign marketing campaigns to the segments. Typically, no response information is available on the customers and so no target variable exists.

Segmentation is based on similarities between input variables. People differ somewhat in their purchasing behavior depending on demographics: their age, family status, and where they live. Demographic information is relatively easy to obtain, and missing data can often be imputed using census information.

The segments are intentionally developed using variables for which complete data are available so that everybody can be assigned to a segment.

After the segments are built, the average age, income, and other statistics are available for each segment.

However, these demographic statistics are not very suggestive of what products to market. The next step is to select a sample from each segment and ask the people about their life-style and product preferences.

Finally, combine the samples from all the segments into one data set and create a tree using the survey questions as inputs and the segment number as the target. Using just a few segments with an equal number of people in each gives the best chance of obtaining a useful tree. With a little luck, the tree will characterize some segments by the type of clothing, cars, or hobbies that suggest what products people in the segment would like to purchase.

Predictive modeling

This section has described how to use trees to overcome some hurdles in predictive modeling. In each example, the tree helps prepare the data or interpret the results of another predictive model. Actually, none of the inventors of tree algorithms were motivated by such supportive roles.

Many individuals have come up with innovative tree creation algorithms. Important ones come from Morgan and Sonquist (1963), Kass (1980), Breiman et al. (1984), and Quinlan (1979). The disparate approaches rival each other. Yet the originators share the common inspiration that trees by themselves are effective predictive models. Each author can describe studies in which trees simply out perform other predictive techniques.

Consider the circumstances that motivated Morgan and Sonquist. They analyzed data in search of determinants of social conditions. In one example, they tried to untangle the influence of age, education, ethnicity, and profession on a person's income. Their best regression contained 30 terms (including interactions) and accounted for 36 percent of the variance.

As an alternative to regression, they organized the observations into 21 groups. The income of an observation in a group was estimated by the group mean. The groups were defined by values on two or three inputs. Nonwhite high school graduates had a mean income of $5,005. White farmers who did not finish high school had a mean income of $3,950, and so on. This method of prediction accounted for 67 percent of the variance.

The study is interesting because it reveals the inadequacy of regression to discern some relationships in data.

Of course every statistician knows this, but Morgan and Sonquist showed how common the problem is in social research and how easily trees get around it. They developed this point in a 1963 article in the Journal of the American Statistical Association. They then created the first statistical tree software and called it AID. The program and its successors stayed in service at the Survey Research Center for more than 30 years.

Trees do not supercede other modeling techniques. Different techniques do better with different data and in the hands of different analysts. However, the winning technique is generally not known until all the contenders get a chance. Trees should contend along with the others. Trees are easy.

How to Create a Tree The simplicity of a decision tree belies the subtleties of creating a good one. A comparable task of fitting a parametric statistical model to data often consists of formulating a log likelihood and solving it numerically.

Numerical optimization for trees is infeasible.

On the other hand, tree creation algorithms are easy to invent. Indeed, in the absence of mathematical theorems for judging the superiority of one method over another, many competing algorithms are available.

Despite this anarchy, the qualities that characterize a good tree model are the same as those of a parametric one. A parametric model assumes that the data share a common relationship between the inputs and the target, and assumes that the relationship is obscured by idiosyncrasies of individual cases (ineptly called "errors"). A good model reveals a relationship that, depending on the needs of the analyst, either accurately describes the data, accurately predicts the target in similar data, or provides insight. Mathematics forms the basis for trusting the model: if the analyst believes the assumptions of the model, he believes the results.

Trees are the same in all respects except one: trust in the model is gained from applying it to a fresh sample.

The following discussion surveys the anarchy of tree creation algorithms and points out what to consider for creating good trees. Murthy (1998) gives a much more extensive survey of algorithmic ideas.

An easy algorithm Inventing a new algorithm for creating a tree is easy. Here is one.

Find the input variable X that most highly correlates with the target. Tentatively split the values of X into two groups and measure the separation of the target values between the groups. For an interval target, use a Student’s t statistic as the measure. For a categorical target, use a chi-square measure of association between the target values and the groups. Find a split on X with the largest separation of target values. Use this split to divide that data. Repeat the process in each group that contains at least twenty percent of the original data. Do not stop until every group of adequate size that can be divided is divided.

For many data sets, this algorithm will produce a useful tree. However, its useful scope is limited. Several choices inherent in the algorithm may have more effective alternatives. The most dangerous choice is the stopping rule that may result in predictions that are worse than having no tree at all. The remainder of this

section discusses the following topics:

◊ Selection of a splitting variable ◊ Number of branches from a split ◊ Elusive best splits ◊ Stepwise, recursive partitioning ◊ Stopping or pruning ◊ Multivariate splits ◊ Missing values Selection of a splitting variable The easy algorithm presented in the preceding section first selects a splitting variable and then searches for a good split of its values. Almost all the popular algorithms search for a good split on all the inputs, one at a time, and then select the input with the best split. This seems more reasonable because the goal is to find the best split.

However, Loh and Shih (1997) report that splitting each variable significantly biases the selection towards nominal variables with many categories. Moreover, the algorithms that search for a good split on each variable take longer. Neither approach dominates in terms of classification accuracy, stability of split, or size of tree. Thus, the issue may be of concern if the goal of the tree creation is interpretation or variable selection, and not prediction.

Loh and Shih did not consider the possibility of searching for a split on each input and then penalizing those splits that are prone to bias when comparing the different inputs. The CHAID algorithm due to Gordon Kass (1980) does this by adjusting p-values.

Number of branches The easy algorithm always splits a node into two branches so as to avoid having to decide what an appropriate number of branches would be. But this easy approach poorly communicates structure in the data if the data more naturally split into more branches. For example, if salaries are vastly different in California, Hawaii, and Nebraska, then the algorithm ought to separate the three states all at once when predicting salaries. Gordon Kass seemed to think so when he commented that binary splits are often misleading and inefficient.

On the other hand, several practitioners advocate using binary splits only. A multiway split may always be accomplished with a sequence of binary splits on the same input. An algorithm that proceeds in binary steps has the opportunity to split with more than one input and thus will consider more multistep partitions than an algorithm can consider in a single-step multiway split. Too often the data do not clearly determine the number of branches appropriate for a multiway split. The extra branches reduce the data available deeper in the tree, degrading the statistics and splits in deeper nodes.

Elusive best splits The easy algorithm blithely claims it finds the split on the selected variable that maximizes some measure of separation of target values. Sometimes this is impossible. Even when it is possible, many algorithms do not attempt it.

To understand the difficulty, consider searching for the best binary split using a nominal input with three categories. The number of ways of assigning three categories to two branches is two times two times two, or eight. The order of the branches does not matter, so only half of the eight candidate splits are distinct. More generally, the number of distinct binary splits of C nominal categories is two times itself C-1 times, written 2^(C-1). Many more splits are possible on inputs with more nominal categories, and this creates a bias towards selecting such variables.

–  –  –

When the target is binary or interval, the best split can be found without examining all splits: order the categories by the average of the target among all observations with the same input category, then find the best split maintaining the order. This method works only for binary splits with binary or interval targets.

For other situations, all the candidate splits must be examined.

Pages:     | 1 || 3 | 4 |   ...   | 6 |

Similar works:

«Ontario Geological Survey Open File Report 6072 Physical Evaluation and Assessment of Bedrock Aggregate Resource Potential, North Shore of Lake Superior ONTARIO GEOLOGICAL SURVEY Open File Report 6072 Physical Evaluation and Assessment of Bedrock Aggregate Resource Potential, North Shore of Lake Superior by Jagger Hims Limited, Clayton Research Limited and Agritrans Limited Parts of this publication may be quoted if credit is given. It is recommended that reference to this publication be made...»

«Methods to derive natural background concentrations of metals in surface water and application of two methods in a case study Methods to derive natural background concentrations of metals in surface water and application of two methods in a case study Leonard Oste Gert Jan Zwolsman Janneke Klein 1206111-005 © Deltares, 2012, B Title Methods to derive natural background concentrations of metals in surface water Client Project Reference Pages Waterdienst 1206111-005 1206111-005-BGS-0006-vj 67...»

«Veld Burning in Natal J. D. SCOTT Professor of Pasture Science, University of Natal, Pietermaritzhurg, South Africa THE subject of veld burning in South Africa is one of the most controversial that can be raised. As with many other subjects, it is one on which people tend to generalise and such generalisations are dangerous, particularly in a country like South Africa with its many variations in veld, soil, and climatic conditions. Veld fires were common in certain parts of the country long...»

«Action on Aid: Steps Toward Making Aid More Effective Homi Kharas∗ Senior Fellow, Wolfensohn Center for Development at Brookings April 2009 UN Secretary General Ban Ki-Moon terms 2009 as a year of a “development emergency.” The World Bank’s Global Monitoring Report 2009 shows that most countries are off-track in meeting most of the MDGs—the most serious gaps appear in sanitation, child and maternal mortality, education, and hunger. The difference between the rate of progress needed to...»

«Minutes of the Special Meeting of the Board of Directors of Council on Aging St. Tammany – COAST Held March 24, 2016 At 72060 Ramos Avenue, Covington, LA 70433 The following directors were present/absent (*indicates absence): Bill Magee (President – left at 5:29 p.m.), Clay Madden (Secretary – left at 5:29 p.m.), Tom Sheldon (Treasurer)*, Audrey Johnson (Vice President), Jan Butler, Diana Norton, Bettie Pogue, Patty Suffern, David Grouchy (arrived at 3:06 p.m.), Larry Rolling (arrived at...»

«Effective Pre-school Provision in Northern Ireland (EPPNI) Summary Report by Edward Melhuish, Louise Quinn, Karen Hanna, Kathy Sylva, Pam Sammons, Iram Siraj-Blatchford and Brenda Taggart No 41, 2006 RESEARCH REPORT Education & Training Statistics No r t h e r n I r e l a n d & Research agency ISBN 1 897592 97 3 The Effective Pre-School Provision in Northern Ireland [EPPNI] Project Summary report 1998-2004 ! $% & # &' Authors : Edward Melhuish Louise Quinn Karen Hanna Kathy Sylva Pam Sammons...»

«FACTORS INFLUENCING PREMATURE INDUCTION OF UDP-GLUCURONYLTRANSFERASE ACTIVITY IN CULTURED CHICK EMBRYO LIVER CELLS BY BRIAN R. SKEA* AND ANDREW M. NEMETH DEPARTMENTS OF BIOCHEMISTRY AND ANATOMY, THE UNIVERSITY, DUNDEE, SCOTLAND, AND UNIVERSITY OF PENNSYLVANIA, PHILADELPHIA Communicated by Louis B. Flexner, July 16, 1969 Abstract.-Very young (5-day-old) chick embryo livers during organ culture on rafts over a chemically defined nutrient medium precociously develop, from zero, adult levels of...»

«MINUTES: November 7, 2007 Regular Meeting of the Board of Education of School District No. 63 (Saanich) The Regular Meeting of the Board of Education of School District No. 63 (Saanich) was held at the Saanich School Board Office, 2125 Keating Cross Road, Saanichton, B.C., on November 7, 2007 at 8:00 p.m. ATTENDEES Chair: Vice Chair:. Peter Garnham Secretary-Treasurer: Trustee: Trustee: Trustee: Trustee: Trustee:.Marika Townshend (regrets) Representative, STA Representative, SAA 1. CALL TO...»

«STATE OF MICHIGAN COURT OF APPEALS COMMUNITY SHORES BANK, UNPUBLISHED August 2, 2012 Plaintiff-Appellant, v No. 305235 Muskegon Circuit Court BABBITT’S SPORT CENTER, LLC, LC No. 09-046857 CZ Defendant-Appellee. Before: DONOFRIO, P.J., and MARKEY and OWENS, JJ. PER CURIAM. In this security interest case, plaintiff, Community Shores Bank, appeals the trial court’s finding for defendant, Babbitt’s Sport Center, LLC of no cause of action based on plaintiff’s failure to prove that it had...»

«Credit Card PIN & PAY – Frequently Asked Questions (FAQ) 1. What is a PIN & PAY card? PIN & PAY card is a PIN enabled card that allows you to make purchase by keying in a 6-digit PIN, with no signature required. It is an enhanced payment method and will make an already safe payment system even more secure. Changing to PIN & PAY payment method is an industry-wide initiative mandated by Bank Negara Malaysia.2. What is a PIN? A PIN, or Personal Identification Number, is a secret code selected by...»

«IV VENEREAL DISEASE IN THE MERCANTILE MARINE DISCUSSION DR. WANSEY BAYLY said he had been much interested in hearing of the difficulties experienced in treating venereal disease in the Mercantile Marine. He once served for a year in the Royal Mail Steam Packet Company, and for the same length of time during the War in the Navy; and his conclusion was that it was very difficult to treat sailors at sea efficiently for venereal affections. The majority of those who go down to the sea in ships were...»

«Five Open Questions About Prediction Markets Justin Wolfers The Wharton School, University of Pennsylvania CEPR, IZA & NBER www.nber.org/~jwolfers jwolfers@wharton.upenn.edu Eric Zitzewitz Stanford GSB http://faculty-gsb.stanford.edu/zitzewitz/ ericz@stanford.edu Abstract Interest in prediction markets has increased in the last decade, driven in part by the hope that these markets will prove to be valuable tools in forecasting, decisionmaking and risk management – in both the public and...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.