FREE ELECTRONIC LIBRARY - Online materials, documents

Pages:     | 1 | 2 || 4 | 5 |   ...   | 6 |

«Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»

-- [ Page 3 ] --

Many available algorithms do not attempt to find the best split on a variable. Such methods of searching are called heuristic because they do not guarantee optimality. For a nominal input variable, a common theme is first to form a split by assigning a different branch to each input category, and then to derive a sequence of candidate splits by combining two branches of one split to produce the next split in the sequence. Different algorithms differ in the details and variations on this theme. Most heuristic algorithms adopt the best split examined. The original CHAID algorithm may be the only exception. CHAID adopts the last split examined.

Recursive partitioning

The easy algorithm first splits the entire data set, the root node, into two nodes. Each node is then split into more nodes, and so on. Each node in its turn is considered in isolation. The search for a split is done without any concern as to how another node is to be split. This process is called recursive partitioning and the great majority of tree creation algorithms do it.

Recursive partitioning is myopic: it focuses on optimizing individual splits and pays no attention to the quality of the entire tree. Alternatives to recursive partitioning have not become popular though. They are hard to implement and take much longer to run. Moreover, recursive partitioning generally gives adequate results.

The first attempt to cure the myopia is to "look ahead one generation" before committing to a split. A candidate split creates two branches. A binary split is then found in each branch. The evaluation of the candidate split is based on the four nodes in the second generation. James Morgan and John Sonquist implemented look-ahead split searches in the AID program in the 1960s. The user's manual by Sonquist et al. (1971) indicates that looking ahead does not help much. A recent, more extensive study by Murthy came to the same conclusion and reported that look-ahead splits sometimes created worse trees.

Chipman, George, and McCullogh (1998) describe an algorithm that searches for an optimal tree. Instead of evaluating candidate splits, their algorithm evaluates candidate trees. One tree is derived from another using a simple operation. The operation is selected at random from a set of possible operations. An example of an operation is selecting a particular variable and forming a random split on that variable. Another operation is to prune a split, transforming an internal node into a leaf. Thus, some operations create larger trees, some create smaller trees. The algorithm is sometimes successful for small trees. Whether the algorithm can examine enough large trees to be useful remains to be seen.

Kristin Bennett (1994) implemented an algorithm for finding globally optimal splits in a tree given the tree structure. The structure consists of the arrangement of nodes and the predicted values assigned to each leaf.

The splits are initially undefined, and therefore the nodes are not initially associated with data. The algorithm defines splits that segregates the data into the nodes in a way that is most consistent with the preassigned predictions in the leaves. A recursive partitioning algorithm can create an initial structure.


The size of a tree may be the most important single determinant of quality, more important, perhaps, than creating good individual splits. Trees that are too small do not describe the data well. Trees that are too large have leaves with too little data to make any reliable predictions about the contents of the leaf when the tree is applied to a new sample. Splits deep in a large tree may be based on too little data to be reliable.

The easy algorithm tries to avoid making trees too small or too large by splitting any node that contains at least twenty percent of the original data. This strategy is easily foiled. Twenty percent of the data will be too few or too many when the original data contain few or many cases. With some data sets, the algorithm will split a node into branches in which one branch has only a couple of cases. Many splits are required to whittle the thick branches down to the limit of twenty percent, so most of the data will end up in tiny leaves.

Certain limitations on partitioning seem necessary:

◊ the minimum number of cases in a leaf ◊ the minimum number of cases required in a node being split ◊ the depth of any leaf from the root node Appropriate values for these limits depend on the data. For example, a search for a reliable split needs less data the more predictable the data are. A stopping rule that accounts for the predictive reliability of the data seems like a good idea. For this purpose, the CHAID algorithm applies a test of statistical significance to candidate splits. Partitioning stops when no split meets a threshold level of significance. Unfortunately, this stopping rule turns out to be problematic in three related aspects: choice of statistical test, accommodations for multiple comparisons, and the choice of a threshold.

A test of significance is problematic first because a test must assume some joint distribution of the target and an input. The assumptions become important in the smaller nodes, which are the more numerous down the tree. The appropriate distribution is generally unknown. Worse, the distributions will differ in different nodes when the tree segregates different subpopulations down different branches.

For algorithms that search for a split on every input variable, the test of significance is performed on the maximum value of the splits on the various inputs. Using the distribution of the extreme values of V test statistics, where V equals the number of inputs, is more appropriate than using the distribution for a single test statistic. Such tests seem impractical for computational reasons. Even if an appropriate test were used, the test would be applied thousands of times in the same tree: once for each input in each node. These criticisms of using significance tests persuade some people but not others. Advocates of the tests adjust the threshold significance levels to accommodate multiple tests.

Pruning Many practitioners contend that stopping rules cannot work. The inherent fallacy is the assumption that an appropriate threshold may be set without much understanding of the data.

The alternative is to grow a tree that is too large, and then prune the tree to the right size. All candidate subtrees pruned from the large tree are available at the same time for comparison, and this gives retrospective pruning an advantage over a stopping rule that can consider only a single node at one time. The pruning process may evaluate entire subtrees instead of individual splits. An evaluation measure may be chosen that is more relevant to the end use of the tree than the splitting criterion. The proportion of cases misclassified is a common choice.

Stopping or pruning is necessary to avoid over-fitting the data. A model "over-fits" when the good fit to the training data (the data used to create the model) is not replicated when the model is applied to a different sample. Ideally, the pruning process uses a separate "pruning data set" to evaluate sub-trees. Subtrees that over-fit the training data will evaluate poorly on the pruning data. Ross Quinlan (1987) may have published the first algorithm that finds the subtree with the smallest error on pruning data. He calls it reduced-error pruning.

The original pruning algorithm is due to Breiman et al. (1984). Evaluating all possible subtrees may have been too time consuming in the 1970s, and so the authors found a fast method of creating a sequence of subtrees of different sizes. Each subtree is a subtree of all the larger ones in the sequence. More importantly, each subtree obtains the optimum assessment among all possible subtrees of its size, based on the training data. Each subtree in the sequence is then applied to a fresh sample, and the larger tree is pruned to the winning subtree. This method is called cost-complexity pruning, and results in different trees than reducederror pruning.

Sometimes tree creation must use all the available data, so no pruning data exist. Ross Quinlan's C4.5 and C5 artificially inflate the error rate of the training data in each node and evaluate subtrees with the inflated rate. This is called pessimistic pruning. Breiman et al. use a cross validation method to choose the subtree from their nested sequence of subtrees.

Multivariate splits

The easy algorithm uses a single input to define a splitting rule. Such univariate splits are easy to understand, are tolerant of missing data (because missing values in any other input are of no concern), and implicitly select a small number of inputs to create an entire tree, thereby reducing subsequent demands on data collection and analysis.

Defining splitting rules with many variables also has merit. The OC1 algorithms of Murthy et al. (1994) and the SVM tree algorithms of Bennett (1997) both base splits on linear combinations of variables and produce trees that are often more accurate and smaller than univariate trees. Smaller trees are easier to understand.

Smaller trees segregate the data into fewer leaves. Describing the data in a leaf in terms of the variables defining the leaf is cumbersome with many defining variables, whether the leaf is in a large tree with univariate splits or a small tree with multivariate splits. The leaves may be easier to understand by plotting the distribution of variables in the leaves, as done in clustering. Having fewer leaves to profile is helpful.

Missing values

Decision trees accommodate missing values very well compared to other modeling methods. The simplest strategy is to regard a missing value as a special nonmissing value. For a nominal input, missing values simply constitute a new categorical value. For an input whose values are ordered, the missing values constitute a special value that is assigned a place in the ordering that yields the best split. The place is generally different in different nodes of the tree.

The strategy pays off when missing values are indicative of certain target values. For example, people with large incomes might be more reluctant to disclose their income than people with ordinary incomes. If income were predictive of a target, then missing income would be predictive of the target, and the missing values would be regarded as a special large income value. The strategy seems harmless when the distribution of missing values is uncorrelated with the target because no choice of branch for the missing values would help predict the target.

Regarding missing values as a special value is sometimes inappropriate. If a large proportion of values is missing, then they may unduly influence the creation of the split. Evaluating a split using only known information improves credibility. Excluding observations with missing values is feasible with univariate trees because only observations missing on a single input are excluded at any one time. This compares favorably with other modeling techniques that exclude observations missing any input value, which may leave very few usable observations.

When missing values are excluded from the split search, a new policy is needed for assigning missing observations to branches. One policy is to use surrogate splitting rules. A surrogate rule substitutes for the main rule when the main rule cannot handle an observation. A good surrogate rule is one that mimics the main rule very well, even if it does not define a good split in its own right. For example, the main rule might split on county, and the surrogate might split on region. The surrogate applies to observations with an unknown county and a known region. The surrogate might be less effective as a main splitting rule because region represents coarser information than county.

The surrogate policy relies on redundant inputs. Another policy is needed when no good surrogate exists.

Assigning all the observations with missing values to a single branch is apt to reduce the purity of the branch, thereby degrading the split. If this is unavoidable, assigning the largest branch results in the least dilution of node purity. Another idea is to split an individual observation and to assign the pieces to all branches in proportion to the size of each branch. For a binary split, the observation is replaced by two observations with fractional sizes. One fractional observation is assigned to each branch. The prediction for an observation is the weighted average of predictions of the derived fractional observations.

What Can Go Wrong Trees may deceive. They may fit the data well but then predict new data worse than having no model at all.

This is called over-fitting. They may fit the data well, predict well, and convey a good story, but then, if some of the original data are replaced with a fresh sample and a new tree is created, a completely different tree may emerge using completely different inputs in the splitting rules and consequently conveying a completely different story. This is called instability.

A little education renders all this harmless. Over-fitting is the more serious issue. The remainder of this section discusses how to detect and avoid it. Instability is a simpler issue. It occurs when two inputs have similar predictive ability. The tree can choose only one. The story conveyed is incomplete. The solution is simply to understand that the story is too simplistic when conflicting variables have comparable predictive ability. Instability is actually beneficial for the new prediction techniques based on re-sampling. The phenomenon is discussed fully in the section “The Frontier.”

Too good a fit

Over-fitting occurs when the tree fits random variations of the target values in the training data that are not replicated in other samples. The tree fits the training data deceptively well. Over-fitting could be largely due to a single spurious split or to the accumulation of small errors from many splits. Splits generally perform better on the training data than on new data. Even a split representing a `true' prediction performs slightly worse on new data because some randomness exists in the data and the split fits some of that.

Smaller nodes are more prone to mistakes.

Pages:     | 1 | 2 || 4 | 5 |   ...   | 6 |

Similar works:

«United States Government Accountability Office GAO Report to Congressional Requesters October 2010 FEDERAL OIL AND GAS LEASES Opportunities Exist to Capture Vented and Flared Natural Gas, Which Would Increase Royalty Payments and Reduce Greenhouse Gases GAO-11-34 October 2010 FEDERAL OIL AND GAS LEASES Accountability • Integrity • Reliability Opportunities Exist to Capture Vented and Flared Natural Gas, Which Would Increase Royalty Payments and Reduce Greenhouse Gases Highlights of...»

«Lisa 4 PAKENDISEADUSANDLUSE ANALÜÜSI LÄBIVIIMINE JA JÄRELEVALVE JUHENDITE NING METOODIKA VÄLJATÖÖTAMINE Aruanne Tallinn Pakendiseadusandluse analüüsi läbiviimine ja järelevalve juhendite ning metoodika väljatöötamine Nimetus Pakendiseadusandluse analüüsi läbiviimine ja järelevalve juhendite ning metoodika väljatöötamine Versioon Esitamiseks Töö nr 10/LS/09 Aeg 30.09.2010 Tellija Keskkonnainspektsioon Aadress: Kopli 76, 10416 Tallinn Telefon/faks: 6962236/6962237...»

«Be sure to read all the books of The Banned and the Banished! Witʼch Fire Witʼch Storm Witʼch War Paperback Paperback Paperback eBook eBook eBook Witʼch Gate Witʼch Star Paperback Paperback eBook eBook WIT’CH FIRE Book One of THE BANNED AND THE BANISHED James Clemens A Del Rey® Book THE BALLANTINE PUBLISHING GROUP • NEW YORK This is a bundled book. You may experience changes in navigation functionality, but the content has not been affected. A Del Rey® Book Published by The...»

«Chapter 23 Preaching in the New Church There is much confusion about preaching today. Some are advocating only one form of preaching; others seem to have abandoned preaching in the name of cultural relevance. Neither of these positions is helpful as you plant a new church. But the preaching of the Word is a mark of a true church whether that preaching is in a circle of ten in a house church or in front of thousands at a rented conference center. Preaching in a new church offers unique...»

«Models for Train Load Planning Problems in a Container Terminal Daniela Ambrosino and Silvia Siri Abstract In this chapter, the train load planning problem for maritime container terminals is dealt with. In the most general case, the optimal assignment of containers to train slots is done considering that it is possible to make reshuffles in the stacking area and to load the train not sequentially; of course, both these types of unproductive movements should be minimized. In the chapter, a...»

«2014 VIVO Conference August 6-8, 2014 Hyatt Regency Austin, Austin, TX VIVO Conference Program Panels/Long Papers Collaboration institutionally and beyond: status, partners, and next steps William Barnett, David Eichmann, Griffin Weber, Eric Meeks, Amy Brand and Holly Falk-Krzesinski Abstract: A key component of success to VIVO and other research networking platforms is their successful use by and utility for science communities. Such success necessarily involves stakeholders representing...»

«Municipal Financial Empowerment: A Supervitamin for Public Programs Strategy #3: Integrating Safe and Affordable Bank Accounts Municipal Financial Empowerment: A Supervitamin for Public Programs Strategy #3: Integrating Safe and Affordable Bank Accounts New York City Department of Consumer Affairs Office of Financial Empowerment Michael R. Bloomberg Mayor Jonathan Mintz Commissioner September 2012 © 2012. New York City Department of Consumer Affairs. All rights reserved. Acknowledgments The...»

«Flood Narratives Greek: Zeus sent a flood to destroy the men of the Bronze Age. Prometheus advised his son Deucalion to build a chest. All other men perished except for a few who escaped to high mountains. The mountains in Thessaly were parted, and all the world beyond the Isthmus and Peloponnese was overwhelmed. Deucalion and his wife Pyrrha (daughter of Epimetheus and Pandora), after floating in the chest for nine days and nights, landed on Parnassus. When the rains ceased, he sacrificed to...»

«Jaime Semprun PRÉCIS DE RÉCUPÉRATION illustré de nombreux exemples tirés de l'histoire récente « Mais ce qui est excellent non seulement ne peut échapper au destin d'être ainsi dévitalisé et déspiritualisé, d'être dépouillé et de voir sa peau portée par un savoir sans vie et plein de vanité ; il doit encore reconnaitre dans ce destin méme la puissance que ce qui est excellent exerce sur les âmes, sinon sur les esprits; il faut y reconnaitre le perfectionnement vers...»

«Masterclass with Christie Marie Sheldon Unblock Your Abundance with Christie Marie Sheldon Unblock Your Abundance Masterclass with Christie Marie Sheldon YOUR OFFICIAL MASTERCLASS GUIDEBOOK 4 Simple Tips To Get The Most Out Of This Class:1. Print out this workbook before the class starts so you can write down your notes as you listen.
 2. Review the topic outline so you know what to listen out for. Make sure you've set aside private time for this session so you’ll be able to focus and fully...»

«1938 491 AUGUST HIERONYMUS MUENZER AND OTHER FIFTEENTH CENTURY BIBLIOPHILES* E. P. GOLDSCHMIDT London is true that we can learn a great deal, about a man bv 12SESESES25-M IT glancing over the bookshelves in his library, it is all the more certain that when by some kind chance the entire book collection of a scholar Of 500 years ago has been af ipnsam-a preserved for us through the centuries, we will be able to gather quite a good deal of valuable knowledge about him, his way of life, his...»

«Intercultural Communication Studies XII-1 2003 Gudykunst AUM Theory Understanding Must Precede Criticism: A Response to Yoshitake’s Critique of Anxiety/Uncertainty Management Theory* William B. Gudykunst California State University, Fullerton Abstract Yoshitake (2002) presents a critical examination of anxiety/ uncertainty management (AUM) theory. Yoshitake’s criticisms reflect a lack of understanding of some of the central constructs in AUM theory, are based on different metatheoretical...»

<<  HOME   |    CONTACTS
2017 www.thesis.dislib.info - Online materials, documents

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.