FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

«Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

-- [ Page 1 ] --

Decision Trees for Predictive Modeling

Padraic G. Neville SAS Institute Inc. 4 August 1999

What a Decision Tree Is...................2

What to Do with a Tree................... 3

Variable selection

Variable importance

Interaction detection

Stratified modeling

Missing value imputation

Model interpretation

Predictive modeling

How to Create a Tree..................... 8

An easy algorithm

Selection of a splitting variable

Number of branches Elusive best splits Recursive partitioning Stopping Pruning Multivariate splits Missing values What Can Go Wrong.....................14 Too good a fit Spurious splits due to too many variables Spurious splits due to too many values The Frontier............................15 Combining models Ensembles Unstable algorithms Bagging Arcing Boosting Well-Known Algorithms..................18 AID, SEARCH CHAID Classification and Regression Trees ID3, C4.5, C5 QUEST OC1 SAS algorithms References.............................22 What a Decision Tree Is A decision tree as discussed here depicts rules for dividing data into groups. The first rule splits the entire data set into some number of pieces, and then another rule may be applied to a piece, different rules to different pieces, forming a second generation of pieces. In general, a piece may be either split or left alone to form a final group.

The tree depicts the first split into pieces as branches emanating from a root and subsequent splits as branches emanating from nodes on older branches. The leaves of the tree are the final groups, the unsplit nodes. For some perverse reason, trees are always drawn upside down, like an organizational chart. For a tree to be useful, the data in a leaf must be similar with respect to some target measure, so that the tree represents the segregation of a mixture of data into purified groups.

Consider an example of data collected on people in a city park in the vicinity of a hotdog and ice cream stand. The owner of the concession stand wants to know what predisposes people to buy ice cream. Among all the people observed, forty percent buy ice cream. This is represented in the root node of the tree at the top of the diagram. The first rule splits the data according to the weather. Unless it is sunny and hot, only five percent buy ice cream. This is represented in the leaf on the left branch. On sunny and hot days, sixty percent buy ice cream. The tree represents this population as an internal node that is further split into two branches, one of which is split again.

40% Buy ice cream

–  –  –

The tree displayed has four leaves. The data within a leaf are much more similar with respect to buying ice cream than the overall data with the 40/60 mix. According to the tree, people almost never buy ice cream unless the weather cooperates and either (1) they have some extra money to spend or (2) they crave the ice cream and presumably spend money irresponsibly or figure out another way of buying it. The tree is easy to understand even if the implication of craving ice cream is left to the imagination. Simple decision trees are appealing because of their clear depiction of how a few inputs determine target groups.

Trees are also appealing because they accept several types of variables: nominal, ordinal, and interval.

Nominal variables have categorical values with no inherent ordering. Ordinal variables are categorical with ordered values, for example: `cold', `cool', `warm', and `hot'. Interval variables have values that can be averaged. Temperature is an interval variable when its values are expressed in degrees. A variable can have any type regardless of whether it serves as the target or as an input. (The input variables are those available for use in splitting rules.) Other virtues of trees include their robustness to missing values and distribution assumptions about the inputs.

Trees also have their short comings. When the data contain no simple relationship between the inputs and the target, a simple tree is too simplistic. Even when a simple description is accurate, the description may not be the only accurate one. A tree gives an impression that certain inputs uniquely explain the variations in the target. A completely different set of inputs might give a different explanation that is just as good. The issues of descriptive accuracy, uniqueness, and reliability of prediction are extensively discussed in this paper.

The adjective 'decision' in "decision trees" is a curious one and somewhat misleading. In the 1960s, originators of the tree approach described the splitting rules as decision rules. The terminology remains popular. This is unfortunate because it inhibits the use of ideas and terminology from "decision theory". The term "decision tree" is used in decision theory and project management to depict a series of decisions for choosing alternative activities. The analyst creates the tree and specifies probabilities and benefits of outcomes of the activities. The software finds the most beneficial path. The project follows a single path and never performs the unchosen activities.

Decision theory is not about data analysis. The choice of a decision is made without reference to data. The trees in this article are only about data analysis. A tree is fit to a data set to enable interpretation and prediction of data. An apt name would be: data splitting trees.

What to Do with a Tree Prediction is often the main goal of data analysis. Creating a predictive model is not as automatic as one might hope. This section describes some of the practical hurdles involved and how trees are sometimes used to overcome them.

Variable selection Data often arrive at the analyst's door with lots of variables. The baggage sometimes includes a dictionary that makes uninteresting reading. Yet the analyst must find something interesting in the data. Most of the variables are redundant or irrelevant and just get in the way. A preliminary task is to determine which variables are likely to be predictive.

A common practice is to exclude input (independent) variables with little correlation to the target (dependent) variable. A good start perhaps, but such methods take little notice of redundancy among the variables or of any relationship with the target involving more than one input. Nominal variables are handled awkwardly.

An alternative practice is to use inputs that appear in splitting rules of trees. Trees notice relationships from the interaction of inputs. For example, buying ice cream may not correlate with having extra money unless the weather is sunny and hot. The tree noticed both inputs. Moreover, trees discard redundant inputs.

Sunny-and-hot and ozone-warnings might both correlate with buying ice cream, but the tree only needs one of the inputs. Finally, trees treat nominal and ordinal (categorical) inputs on a par with interval (continuous) ones. Instead of the categorical sunny-and-hot input, the split could be at some value of temperature, an interval input.

The analyst would typically use the selected variables as the inputs in a model such as logistic regression. In practice trees often provide far fewer variables than seem appropriate for a regression. The sensible solution is to include some variables from another technique, such as correlation. No single selection technique is capable of fully prophesizing which variables will be effective in another modeling tool.

Hard-core tree users find ways to get more variables out of a tree. They know how to tune tree creation to make large trees. One method uses the variables in 'surrogate' splitting rules. A surrogate splitting rule mimics a regular splitting rule. Surrogates are designed to handle missing values and are discussed more in that context. Using them to select variables is an afterthought. A surrogate variable is typically correlated with the main splitting variable, so the selected variables now have some redundancy.

Variable importance

The analyst may want the variable selection technique to provide a measure of importance for each variable, instead of just listing them. This would provide more information for modifying the selected list to accommodate other considerations of the data analysis. Intuitively, the variables used in a tree have different levels of importance. For example, whether or not a person craves ice cream is not as important as the weather in determining whether people will buy ice cream. While craving is a strong indication for some people, sunny and hot weather is a good indicator for many more people. What makes a variable important is the strength of the influence and the number of cases influenced.

A formulation of variable importance that captures this intuition is as follows: (1) Measure the importance of the model for predicting an individual. Specifically, let the importance for an individual equal the absolute value of the difference between the predicted value (or profit) of the individual with and without the model.

(2) Divide the individual importance among the variables used to predict the individual, and then (3) average the variable importance over all individuals.

To divide the importance for an individual among the variables (2), note that only the variables used in splitting rules that direct the individual from the root down the path to the leaf should share in the importance. Apportion the individual importance equally to the rules in the path. If the number of such rules is D, and no two rules use the same variable, then each variable is accredited with 1/D times the individual importance. This formulation ensures that the variable used to split the root node will end up being the most important, and that a variable gets less credit for individuals assigned to leaves deeper down the tree. Both consequences seem appropriate.

Currently no software implements this formulation of importance. Some software packages implement an older formulation that defines the importance of a splitting rule, credits that importance to the variable used in the rule, and then sums over all splitting rules in the tree. For an interval target, the importance of a split is the reduction in the sum of squared errors between the node and the immediate branches. For a categorical target, the importance is the reduction in the Gini index. Refer to Breiman et al. (1984) and Friedman (1999) for discussions about this approach.

Interaction detection Having selected inputs to a regression, the analyst will typically consider possible interaction effects.

Consider modeling the price of single family homes. Suppose that the prices of most houses in the data set are proportional to a linear combination of the square footage and age of the house, but houses bordering a golf course sell at a premium above what would be expected from their size and age. The best possible model would need a golf course indicator as input. Data rarely come with the most useful variables.

However, it seems plausible that the houses bordering the golf course are about the same size and were built about the same time. If none of the other houses of that size were built during that time, then that size and time combination provides an indication of whether the house borders the golf course. The regression should contain three variables: square footage, age, and the golf course indicator. The indicator is constructed from square footage and age and, therefore, represents an interaction between those two inputs.

The analyst of course is not told anything about any golf course, much less given any clues as to how to construct the indicator. The seasoned analyst will suspect that key information has to be inferred indirectly, but what and how is guess work. He will try multiplying size and age. No luck. He might then try creating a tree and creating an indicator for each leaf. For a particular observation, the indicator equals one if the observation is in the leaf and equals zero otherwise. The regression will contain square footage, age, and several indicators, one for each leaf in the tree. If the tree creates a leaf with only the houses bordering the golf course, then the analyst will have included the right interaction effects. The indicators for the other leaves would not spoil the fit. Indicators for nonleaf nodes are unnecessary because they equal the sum of indicators of their descendents.

This technique works when the influence of the interaction is strong enough, but it does not work otherwise.

Consider a data set containing one target Y and one input X that are related through the equation

–  –  –

where a and b are constants, and W is like the golf course indicator: it takes values 0 and 1, it is omitted from the data set, but it is 1 if and only if X is in some interval. The analyst must infer the influence of W on Y using only X.

The diagram illustrates the situation in which W equals one when X is near the middle of its range. The tree creation algorithm searches for a split on X that best separates data with small values of Y from large ones.

If b is large enough, values of Y with W = 1 will be larger than values of Y with W = 0, and the algorithm will make the first two splits on X so as to isolate data with W = 1. A tree with three leaves appears in which one leaf represents the desired indicator variable.


X When b is smaller so that data with extreme values of Y have W = 0, the data with W = 1 have less influence on the algorithm. A split that best separates extreme values of Y is generally different from a split that best detects interactions in linear relationships. A tree creation algorithm specially designed for discovering interactions is needed. Otherwise, even if the W = 1 data is eventually separated, the tree will have many superfluous leaves.

Stratified modeling The analyst preparing for a regression model faces another hidden pitfall when the data represent two populations. A different relationship between an input and the target may exist in the different populations.

In symbols, if Y = a + b X + e and Y = a + c X + e express the relationship of Y to X in the first and second population respectively, and b is very different from c, then one regression alone would fit the combined data poorly.

Pages:   || 2 | 3 | 4 | 5 |   ...   | 6 |

Similar works:

«J. Linguistics 25 (1989), 127-157. Printed in Great Britain Layers and operators in Functional Grammar1 KEES HENGEVELD Institute for General Linguistics, University of Amsterdam (Received 26 May 1988; revised 27 September 1988) 1. I N T R O D U C T I O N I have argued elsewhere (Hengeveld, 1987 b) that for a proper treatment of modality the clause model used in Functional Grammar (Dik, 1978, 1980) should be adapted in such a way that a number of different layers can be distinguished. My main...»

«HOW TO PREACH PEACE (Without Being Tuned Out) Written by the Rev. Richard G. Watts Originally published in The Pastor’s Letter, Volume 2, Number 10 (October 1981) Revised by the Rev. W. Mark Koenig, July 2003 Preaching peace does not seem all that controversial. After all, who is against peace? But what if a preacher tries to get specific about Jesus’ call for us to be not only peace lovers but also peacemakers? What if one’s commitment to peace in Christ leads to asking if the “war on...»

«SECTION 508 TIMBER PILING 508.1 Description This section describes furnishing, treating if required, driving, including preboring if required, and cutting (1) off untreated timber test piling, treated timber test piling, untreated timber foundation piling, treated timber foundation piling, untreated timber trestle piling or treated timber trestle piling.508.2 Materials 508.2.1 General (1) Furnish materials conforming to the following: Structural steel Miscellaneous metals Creosote-coal tar...»

«From the Transactions of the Bristol and Gloucestershire Archaeological Society The Reform of Gloucester’s Municipal Corporation in 1835’ (pp. 311–329). by Alan Sparkes 2007, Vol. 125, 311-329 © The Society and the Author(s) Trans. Bristol & Gloucestershire Archaeological Society 125 (2007), 311–329 The Reform of Gloucester’s Municipal Corporation in 1835 By ALAN SPARKES Gloucester’s municipal corporation evolved through a succession of medieval royal charters culminating in...»

«2015 HEART OF KANSAS ANNUAL REPORT Table of Contents 01 Many Members of One Body Letter from the DOM “Many Members of One Body” 2015 Annual Meeting Program Past Minutes AMDC Report Financial Information 2016 Recommended Shiloh Funding Budget 2016 Recommended Budget Recommendations: Officers, Committee on Committees, Time/Place/Preacher Report 02 Kingdom-Vision; Kansas-based Plan Letters from CPCs 2016 Calendar 03 Acknowledgements People Executive Board Members Participating Churches Good...»

«_ Part # 19445 Scion FR-S Subaru BRZ Sport Coil Spring Kit Thank you for your purchase from our new line of Scion parts. Please call us at 877 4NO ROLL if you have any questions regarding the service or installation of your Hotchkis Performance products.PLEASE READ THE ENTIRE MANUAL BEFORE STARTING. SPECIAL TOOLS MAY BE REQUIRED TO PROCEED. Front Coil Spring Install 1F. Raise front end of the vehicle using a floor jack or other lifting device. Chock the rear wheels to prohibit the vehicle from...»

«Wolf Laurel Road Maintenance & Security Homeowners Association Minutes of the Meeting of the Board of Directors October 21, 2016 The meeting was called to order by President Beneke at 10:00 A.M. Directors in attendance in person: Romero, Foster, McKnight, McCaghren and Moeller, Directors in attendance by telephone: Ames and Smith. Directors absent: Potts Staff in attendance: Brown, Gwozdo, and Wyatt Minutes: Upon motion made by McKnight and seconded by Foster, the minutes of September 16, 2016...»

«RAM ENERGY RESOURCES, INC. 5100 East Skelly Drive, Suite 650 Tulsa, Oklahoma 74135 To the Stockholders of RAM Energy Resources, Inc.: You are cordially invited to attend the Annual Meeting of Stockholders of RAM Energy Resources, Inc. to be held on May 5, 2011, at the Westin New York at Times Square, 270 West 43rd Street, New York, New York 10036, commencing at 10:00 a.m., local time. We look forward to personally greeting as many of our stockholders as possible at the meeting. The Notice of...»

«WORLD TRADE TN/RL/W/246 27 November 2009 ORGANIZATION (09-6018) Negotiating Group on Rules RULES NEGOTIATING GROUP INFORMAL OPEN-ENDED MEETING WITH SENIOR OFFICIALS: 25 NOVEMBER 2009 Statement by the Chairman Good morning and welcome to this informal open-ended meeting of the Negotiating Group on Rules. I would in particular like to extend a warm welcome to Senior Officials and Heads of Delegation who are here with us today. I certainly appreciate your interest in the work of this Group. The...»

«THE SHRINES OF WHITEFRIAR STREET CHURCH, DUBLIN Contents Our Lady of Mount Carmel 3 Our Lady of Dublin 5 Our Lady of Lourdes 6 Our Lady of Fatima 6 Calvary 7 The Sacred Heart of Jesus 8 St Joseph 8 St Valentine 9 St Jude 10 St Anne 12 St Thérèse of Lisieux 14 St Albert of Sicily 16 Blessed Titus Brandsma 17 St Anthony 18 The Infant of Prague 18 Pope St Pius X 19 THE SHRINE OF OUR LADY OF MOUNT CARMEL The Sacred Scriptures speak of the beauty of Mount Carmel where the Prophet Elijah defended...»

«J o i n t U K B T S / N I B SC Pr o f e s s i o n a l A d v i s o r y Co m m i t t e e (*) UKBTS General Information 03 Deviations from 4 oC temperature storage for red cells: Effect on viability and bacterial growth December 2009 Prepared by: Standing Advisory Committee on Blood Components This document will be reviewed whenever further information becomes available. Please continue to refer to the website for in-date versions.1. Introduction a) Regulatory requirements for storage temperature...»

«WHITE PAPER Enhancing Adolescent Financial Capabilities through Financial Education in Bangladesh Sajeda Amin, Laila Rahman, Sigma Ainul, Ubaidur Rob, Bushra Zaman and Rinat Akter Population Council September 2010 Page | ii CONTENTS Acknowledgments v Acronyms vii Executive summary ix Introduction 1 Objectives 1 Rationale 2 Scoping study on enhancing financial capabilities in Bangladesh 4 Evidence from the secondary data 5 Listening to girls’ voices 7 Consultative discussions with stakeholders...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.