FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |

«Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

-- [ Page 4 ] --

The surest way to detect over-fitting is to apply the tree to new data. To detect what size of a tree provides a good predictive fit, prune the large tree to subtrees of several sizes, and apply all the subtrees to the new data. Compare the fit to the new data as the tree size increases. Prune to the sub-tree where the fit flattens out. In practice, the choice of subtree is not always clear-cut. The improvement in fit will decrease gradually, sometimes bouncing up and down around a trend. The analyst must decide on the details for choosing the tree size.

Spurious splits due to too many variables

An appealing feature of trees with univariate splits is the resilience against the curse of dimensionality: faced with many inputs, the tree creation algorithm finds a few with predictive ability. Even with more inputs than observations, an algorithm can create a useful tree. In successful examples like these, inputs with predictive ability beat out the inputs that might have only a chance association with the target. The more inputs that are irrelevant, the greater the chance that random association of some irrelevant input will be with the target, and the larger the true association of a predictive input must be to be selected for a splitting rule. When searching for a few good inputs among a thousand or more irrelevant ones, the creation of spurious splits is an important concern.

For a good split, the distribution of new data in the branches will be similar to the distribution of the training data. A spurious split will have almost no worth when evaluated on new data, and therefore the distributions of the new data in the branches will be similar to those in the node being split, and not like the distributions of the training data in the branches. Consequently, spurious splits are easy to detect when enough new data are available.

When spurious splits are detected retrospectively, the only remedy is to prune them. However, the same spurious splits could be avoided by using the new data during the split search. The algorithm would use the training data to select a split for each input and then use the new data to compare the selected splits between inputs.

When significance tests are used to evaluate a split, the test can be modified to account for the number of inputs. A simple way to do this is to multiply the p-value of a test statistic by the number of inputs before comparing the statistic to a threshold level of significance. A better method is to base the p-value on the distribution of the best V test statistics, where V equals the number of inputs. Both of these methods work by making it harder to accept any individual split. The more irrelevant inputs, the better a truly predictive split must be to be accepted. Modifying a test for more and more inputs eventually prevents accepting any input, no matter how good.

The best strategy for avoiding spurious results from too many inputs may be to reduce the number of inputs to begin with. Although trees are resilient against many inputs, they are not complete cures.

Spurious splits due to too many nominal values Spurious splits are also common using nominal inputs with many values. Consider an extreme example in which an input contains a different value for each case, as would occur with a customer id number. The algorithm is free to assign a case to any branch. The algorithm strives to assign cases with different target values to different branches. With a binary target and customer id as an input, the algorithm will do so. Of course, the splitting rule is completely unreliable.

Reliability stems ultimately from repetition: if snow has fallen every January in a certain city every year for the past ten years, it is likely to snow there next year also. If many cases occur with a particular nominal input value, and if the target value is the same for most of them, then the input value predicts the target value somewhat reliably. Even if cases with the input value have diverse target values so that the input value poorly predicts the target, no algorithm can use the input to untangle the target, and so no algorithm can fake a good prediction. The key for finding a reliable nominal split is to use only input values that are represented with many cases. Treat input values with too few cases as if they were missing values. The number of cases depends on the distribution of the target values for the given input value.

An alternative strategy to avoid spurious splits with nominal inputs is to use a significance test that takes account of the number of input values. The worth of a split is penalized more for using a nominal input with more values. The consequences are similar to those resulting from adjusting significance tests for many inputs: inputs with many nominal values are effectively eliminated. Other strategies are to combine values so as to reduce their number or just to exclude the input at the outset.

The Frontier An important new technique is emerging from research in predictive modeling: strategically re-sampling the data and fitting a model to each sample produces an ensemble that is more accurate and stable than any single model. Decision trees have been the preferred model for comprising the ensemble. The instability of trees, so disastrous for interpretation, is an asset for creating ensembles. The rest of this section explains the phenomenon more fully.

Combining models

An average of several measurements is often more accurate than a single measurement. This happens when the errors of individual measurements more often cancel each other than reinforce each other. An average is also more stable than an individual measurement: if different sets of measurements are made on the same object, their averages would be more similar than individual measurements in a single set are.

A similar phenomenon exists for predictive models: a weighted average of predictions is often more accurate and more stable than an individual model prediction. Though similar to what happens with measurements, it is less common and more surprising. A model relates inputs to a target. It seems surprising that a better relationship exists than is obtainable with a single model. Combining the models must produce a relationship not obtainable in any individual model.

An algorithm for training a model assumes some form of the relationship between the inputs and the target.

Logistic regression assumes a linear relation. Decision trees assume a constant relation within ranges of the inputs. Neural networks assume a specific nonlinear relationship that depends on the architecture and activation functions chosen for the network.

Combining predictions from two different algorithms may produce a relationship of a different form than either algorithm assumes. If two models specify different relationships and fit the data well, their average is apt to fit the data better. If not, an individual model is apt to be adequate. In practice, the best way to know is to combine some models and compare the results.


An "ensemble" or "committee" is a collection of models regarded as one combined model. The ensemble predicts an observation as an average or a vote of the predictions of the individual model predictions. The different individual models may give different weights to the average or vote.

For an interval target, an ensemble averages the predictions. For a categorical target, an ensemble can average the posterior probabilities of the target values. Alternatively, the ensemble can classify an observation into the class that most of the individual models classify it. The latter method is called voting and is not equivalent to the method of averaging posteriors. Voting produces a predicted target value but does not produce posterior probabilities consistent with combining the individual posteriors.

Unstable algorithms

Sometimes applying the same algorithm to slightly different data produces a very different model. Stepwise regression and tree-based models behave this way when two important inputs have comparable predictive ability. When a decision tree creates a splitting rule, only one input is chosen. Changing the data slightly may tip the balance in favor of choosing another input. A split on one input might segregate the data very differently than a split on the other input. In this situation, all descendent splits are apt to be different.

The unstable nature of tree-based models renders the interpretation of trees tricky. A business may continually collect new data, and a tree created in June might look very different than one created the previous January. An analyst who depended on the January tree for understanding the data is apt to become distrustful of the tree in June, unless he investigated the January tree for instability. The analyst should check the competing splitting rules in a node. If two splits are comparably predictive and the input variables suggest different explanations, then neither explanation tells the whole story.


The instability of an algorithm makes possible a different strategy for creating an ensemble of models:

instead of varying the algorithm, apply the same algorithm to different samples of the original data. If the algorithm is unstable, the different samples yield different models. The average of the predictions of these models might be better than the predictions from any single model. If the algorithm is stable, then the different samples yield similar models, and their average is no better than a single model.

The usual procedure for creating one of the sampled training sets is as follows. Pick a data case at random from the original training data set and copy it to the sampled data set. Repeat this a second time. The same case might be copied twice, but more likely two different cases will be chosen. Keep repeating this process until the sampled data set contains N cases, where N is arbitrary but is usually equal to the number of cases in the original training data set. This process is called sampling with replacement.

"Bagging" refers to the creation of an ensemble of models using the same algorithm on data sets sampled with replacement from a single training data set. Bagging stands for "bootstrap aggregation," and it is the invention of Leo Breiman (1996). He uses the voting method for classification.


Bagging uses independent samples of the original data. Arcing takes a sample that favors cases that the current ensemble predicts relatively poorly. The first model trains on all of the original data. Cases that the first model incorrectly classify are preferentially sampled for the second training data set. A second model is trained, and the two models are combined. Cases for which the combined model performs poorly are preferentially sampled for the third training data set, and so on.

Leo Breiman (1998) introduced the term "arcing" to stand for "adaptive resampling and combining," and this seemed a better description of the process of "boosting" introduced by Freund and Schapire (1995, 1996).

Unlike bagging, arcing may give different weights to the individual models. Specifying the weights and the probabilities for sampling a case completely specifies an arcing algorithm. No particular specification seems generally superior to others in practice. Arcing applies only to categorical targets.


The term "boosting" is much more commonly used than "arcing" and generally means the same thing.

However, Friedman, Hastie, and Tibshirani (1998, 1999) have reformulated the concept without reformulating a new name. Boosting, as reformulated, creates an ensemble consisting of a weighted average of models fit to data in which the target values are changed in a special way. In all other respects, the data are the same for all models.

As an example, consider minimizing the sum of squared errors between an interval target and a prediction from an ensemble. The first individual model is fit to the original data. For the training data set for the second model, subtract the predicted value from the original target value, and use the difference as the target value to train the second model. For the third training set, subtract a weighted average of the predictions from the original target value, and use the difference as the target value to train the third model, and so on.

The weights used in averaging the individual models and the prescription for creating a target value depend on the measure of loss associated with the problem. The prescription just described is used when minimizing the sum of squared errors. A different prescription is used when minimizing the sum of absolute errors. Thus, the measure of loss determines the particulars of a boosting algorithm.

Well-Known Algorithms In conversations about trees, someone is apt to refer to such-and-such algorithm. The following list contains the best known algorithms and describes how they address the main issues in this paper. Each algorithm was invented by a person or group inspired to create something better than what currently existed. Some of their opinions on strategies for creating a good tree are included. The last entry is SAS algorithms. The SAS software lets the user mix some of the better ideas of the well-known programs. It is still being improved.


James Morgan and John Sonquist proposed decision tree methodology in an article in the Journal of the American Statistical Association in 1963. They cite similar ideas of Belson (1959) and others. Their motivation has already been discussed under "Predictive modeling" in the section, "What to Do with a Tree.” Their original program was called AID, which is an acronym for "automatic interaction detection."

AID represents the original tree program in the statistical community. The third version of the program was named SEARCH and was in service from about 1971 to 1996.

The user's manual for SEARCH (1971, 1975) discusses experiments with tree methodology. The most successful idea was to specify an independent variable for a linear regression, and to search for splits that either optimize the regression fit in the branches or most differentiate slopes of the regression lines in the branches. Often an input variable "so dominates the dependent variable that the data are split on little else."

The regression technique removes the dominant effect and reveals what else matters.

Pages:     | 1 |   ...   | 2 | 3 || 5 | 6 |

Similar works:

«International Education Studies; Vol. 6, No. 4; 2013 ISSN 1913-9020 E-ISSN 1913-9039 Published by Canadian Center of Science and Education The Relationship between Educational Resources of School and Academic Achievement Havva Sebile SAVASCI1 & Ekber TOMUL2 Isaabat Primary School, Provincial Directorate of National Education, Tutak, Agrı, Turkey Faculty of Education, Mehmet Akif Ersoy University, 15100-Burdur, Turkey Correspondence: Ekber TOMUL, Faculty of Education, Mehmet Akif Ersoy...»

«Thai Forest Tradition Chants* Madison Insight Meditation Group www.vipassana.net *As recited in the tradition of Ajahn Chah, Ajahn Sumedho and Amaravati Buddhist Monastery. With gratitude to Abhayagiri Buddhist Monastery, Redwood Valley, CA, www.abhayagiri.org Thai Forest Tradition Chants Madison Insight Meditation Group www.vipassana.net MORNING CHANTING — Pali & English Dedication of Offerings Preliminary Homage to the Buddha Homage to the Buddha Homage to the Dhamma Homage to the Sangha...»

«Sermon #3178 Metropolitan Tabernacle Pulpit 1 THE PREPARATORY PRAYERS OF CHRIST NO. 3178 A SERMON PUBLISHED ON THURSDAY, DECEMBER 30TH, 1909, DELIVERED BY C. H. SPURGEON, AT THE METROPOLITAN TABERNACLE, NEWINGTON, ON THURSDAY EVENING, AUGUST 7, 1873. “Now when all the people were baptized, it came to pass that Jesus, also being baptized, and praying, the heaven was opened, and the Holy Spirit descended in a bodily shape like a dove upon Him, and a voice came from heaven, which said, You are...»

«REVIEW OF ACCIDENT PRECURSORS FOR LOSS OF CONTROL IN FLIGHT EOFDM Working Group A European Operators FDM Forum SUMMARY This report studies potential precursors that could result in loss of control accidents while the aircraft is airborne. From this list several flight data measurements are proposed for further analysis and development by EOFDM Working Group B. INTRODUCTION In the context of this study, loss of control (LOC) refers to loss of aircraft control or deviation from the intended...»

«AB56 1 Dijkstra-Scholten predicate calculus: concepts and misconceptions Lex Bijlsma and Rob Nederpelt lex.bijlsma@acm.org October, 1998 Abstract The paper focusses on the logical backgrounds of the Dijkstra-Scholten program development style for correct programs. For proving the correctness of a program (i.e. the fact that the program satisfies its specifications), one often uses a special form of predicate calculus in this style of programming. We call this the Dijkstra-Scholten (DS)...»

«TANZANIA FINAL REPORT GENERAL ELECTIONS OCTOBER 2010 EUROPEAN UNION ELECTION OBSERVATION MISSION This report was produced by the European Union Election Observation Mission to Tanzania 2010 and presents the Mission’s findings on the 31 October 2010 general elections. These views have not been adopted or in any way approved by the European Commission and should not be relied upon as a statement of the European Commission. The European Commission does not guarantee the accuracy of the data...»

«For more information on the Memphis City Council, visit www.memphiscitycouncil.com. Also at this address, you may listen to live council meetings by clicking on the green box “Listen Online” and selecting the meeting in session. To listen to past committee or council meetings, click on the item of your choice under “Archived Meetings.” Councilman Myron Lowery can be reached at myron.lowery@memphistn.gov or by calling the Council Office at (901) 576-7016 or Councilman Lowery’s cell at...»

«Shepherd College Board of Governors September 11, 2003 Agenda Item No. 3 STUDENT HOUSING PROJECT PROSPECTUS In November 2002, Shepherd College engaged Brailsford & Dunlavey (B&D) to complete a Student Housing Feasibility Study. The final plan evaluates the extent to which the Residence Life Office (RLO) is contributing to the College’s institutional goals, keeping pace with national and regional housing trends and addressing future demand for student housing. Accordingly, B&D completed a...»

«The Doctrine of the Future Chapter 54 The Return of Christ: When and How? When and how will Christ return? Could he come back at any hour? EXPLANATION AND SCRIPTURAL BASIS As we begin the final unit of this book, we turn to consider events that will happen in the future. The study of future events is often called “eschatology,” from the Greek word σχατος (G2274) which means “last.” The study of eschatology, then, is the study of “the last things.” Unbelievers can make...»

«Page 1 Hardardóttir Ph.D. thesis Metal-rich Scales in the Reykjanes Geothermal System, SW Iceland: Sulfide Minerals in a Seawater-dominated Hydrothermal Environment Vigdís Hardardóttir Thesis submitted to the Faculty of Graduate and Postdoctoral Studies In partial fulfillment of the requirements For the PhD degree in Earth Sciences Department of Earth Sciences Faculty of Science University of Ottawa © Vigdís Hardardóttir, Ottawa, Canada, 2011 Page 2 ii Hardardóttir Ph.D. thesis I...»

«ATHLETIC SCHOLARSHIP INFORMATION Stanford University Cost of Attendance (COA) and Student Budget The standard student budget (“Budget”) reflects the budget for a student possession the characteristics considered common for the typical Stanford undergraduate student, namely:  Unmarried  Financially dependent on parents  Living in a residence hall or off-campus  Attending full-time for Autumn, Winter and Spring quarters The Budget includes the actual cost of tuition and certain...»

«TWO SHORT ESSAYS Devotion in Buddhism and Courageous Faith by Nyanaponika Thera from The Vision of Dhamma ISBN 955-24-0108-9 Copyright 1994 Nyanaponika Thera BUDDHIST PUBLICATION SOCIETY KANDY SRI LANKA *** BuddhaNet Edition 1996 via BuddhaNet by arrangement with the Publisher. BuddhaNet P.O. Box K1020, Haymarket, NSW 2000, AUSTRALIA. ******** DEVOTION IN BUDDHISM The Buddha repeatedly discouraged any excessive veneration paid to him personally. He knew that an excess of purely emotional...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.