«Decision Trees for Predictive Modeling Padraic G. Neville SAS Institute Inc. 4 August 1999 What a Decision Tree Is................. ...»
Belson, W. A. (1959), "Matching and Prediction on the Principle of Biological Classification," Applied Statistics, 8, 65-75. Probably the first published use of decision tree ideas.
Bennett, K.P. (1994), "Global Tree Optimization: A Non-Greedy Decision Tree Algorithm," Computing Science and Statistics, 26, 156-160.
Bennett, K.P. and Blue, J. (1997), "A Support Vector Machine Approach to Decision Trees." Available from http://postulate.math.rpi.edu/~bennek.
Berry, M. and Linoff, G. (1997), Data Mining Techniques, New York: John Wiley and Sons, Inc. Contains a non-mathematical introduction to the methods of Breiman et al, Kass (CHAID), and Quinlan (C4.5).
Breiman, L. (1996), "Bagging Predictors,” Machine Learning, 24, No. 2, 123-140.
Breiman, L. (1998), "Arcing Classifiers," Annals of Statistics, 26, 801-824. Includes a discussion by Fruend and Shapire. Explore ftp.stat.Berkeley.edu/pub/users/breiman.
Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984), Classification and Regression Trees, Belmont, California: Wadsworth, Inc.
Carson, R.T. (1984), "Compensating for Missing Data and Invalid Responses in Contingent Valuation Surveys," Proceedings of the Survey Research Section of the American Statistical Association. Very obscure paper that compares imputing missing values using trees with expectation-maximization methods.
Chipman, H., George, E.I., and McCullogh, R.E. (1998), "Bayesian CART Model Search (with discussion)," Journal of the American Statistical Association, 93, 935-960. Randomly selects splitting and pruning operation to search for a good tree.
Devroye, L., Gyorfi, L., and Gabor, L. (1996), A Probabilistic Theory of Pattern Recognition, New York:
Springer-Verlag Inc. Excellent text for those comfortable with a measure theoretic formulation of probability theory. Argues (page 338) that impurity measures as used by Breiman et al and Quinlan are too easily fooled, and should therefore be avoided. Applies Vapnik-Chervonenkis theory to construct trees that, with enough data, approach the limiting error rate (Bayes risk) inherent in the data.
Esposito, F., Malerba, D., and Semeraro, G. (1997), "A Comparative Analysis of Methods for Pruning Decision Trees," IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (5), 476-491.
Freund, Y. (1995), "Boosting a Weak Learning Algorithm by Majority," Information and Computation, 121, 256-285. Introduces an early AdaBoost algorithm in the context of "probably approximately correct" concept learning models.
Freund, Y. and Schapiro, R. (1996), "Experiments with a New Boosting Algorithm," in Machine Learning:
Proceedings of the Thirteenth International Conference, 148-156. Presents AdaBoost.
Friedman, J. H., "Greedy Function Approximation: A Gradient Boosting Machine,” (1999). Postscript technical report trebst.ps available from http://www.stat.stanford.edu/~jhf/ftp. Complete mathematical reformulation of boosting as a series of models on data whose target values are modified.
Friedman, H.J., Hastie, T., and Tibshirani, R. (1998), "Additive Logistic Regression: a Statistical View of Boosting." Technical report available from ftp://stat.stanford.edu/pub/friedman/boost.ps.Z. Mathematical reformulation of boosting techniques containing some of the basic ideas of the 1999 Friedman technical report.
Hunt, E.B. and Hovland, C.I. (1961), "Programming a Model of Human Concept Formation," Proceedings of the Western Joint Computer Conference, 145-155. First published implementation of a concept-learning program. The program constructs binary trees from data with a binary target and nominal inputs. The information is also in the book by Hunt, Marin, and Stone (1966).
Hunt, E.B., Marin, J., and Stone, P.J. (1966), Experiments in Induction, San Diego, CA: Academic Press, Inc. Seminal text on Concept Learning Systems (CLS). A CLS tries to mimic the human process of learning a concept, starting with examples from two classes and then inducing a rule to distinguish the two classes based on other attributes. The book deals with automatically constructing binary decision trees for a binary target. Hunt was Ross Quinlan's Ph.D. advisor. Ross Quinlan attributes his work on trees (ID3, C5) as originating in Hunt’s ideas about CLS.
Jensen, D. D. and Cohen, P. R (1999), "Multiple Comparisons in Induction Algorithms," Machine Learning (to appear). Excellent discussion of bias inherent in selecting an input. Explore http://www.cs.umass.edu/~jensen/papers.
Kass, G.V. (1975), "Significance Testing in Automatic Interaction Detection (A.I.D.)," Applied Statistics, 24: 178-189. Computes a distribution appropriate for selecting the best split on several inputs.
Kass, G.V. (1980), "An Exploratory Technique for Investigating Large Quantities of Categorical Data," Applied Statistics, 29, 119-127. Standard reference for the CHAID algorithm Kass described in his 1975 Ph.D. thesis.
Loh, W. and Shih, Y. (1997), "Split Selection Methods for Classification Trees," Statistica Sinica, 7, 815Introduces the QUEST algorithm. Refer to http://www.stat.wisc.edu/~loh.
Loh, W. and Vanichsetakul N. (1988), " Tree-Structured Classification Via Generalized Discriminant Analysis," Journal of the American Statistical Association, 83, 715-728.
Morgan, J.N. and Messenger, R.C. (1973), "THAID- a Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables," Survey Research Center, Institute for Social Research, University of Michigan. Cited by Kass (1980) as suggesting treating missing values as a special, acceptable value while searching for a split.
Morgan, J.N. and Sonquist, J.A (1963), "Problems in the Analysis of Survey Data, and a Proposal," Journal of the Americal Statistical Association, 58, 415-35. Thoughtful discussion of the limitations of additive models, and a seminal proposal for sometimes using decision trees in their stead. Still worth reading.
Murthy, S.K. (1998), "Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey," Data Mining and Knowledge Discovery, 2, 345-389. Extensive survey of ideas for constructing decision trees.
Murthy, S.K. and Salzberg, S., (1995), "Lookahead and Pathology in Decision Tree Induction," Proceedings of the Fourtenth International Joint Conference on Artificial Intellegence.
Murthy, S., Simon, K., Salzberg, S., and Beigel, R. (1994), "OC1: Randomized Induction of Oblique Decision Trees," Journal of Artificial Intelligence Research. Refer to http://www.cs.jhu.edu/~murthy/ for more OC1 links.
Neville, P. (1999), "Growing Trees for Stratified Modeling," Computing Science and Statistics, 30. Extends the ideas of Sonquist et al. and Alexander and Grimshaw for more independent variables and more types of models.
Quinlan, R.J. (1979), "Discovering Rules from Large Collections of Examples: A Case Study," Expert Systems in the Micro Electronic Age, Edinburgh University Press, Michie, D. editor. Introduces the ID3 (Iterative Dichotomizing 3rd) algorithm.
Quinlan, R.J. (1987), “Simplifying Decision Trees,” International Journal of Man-Machine Studies, 27, 221-234. Introduces reduced-error pruning.
Quinlan, R.J. (1993), C4.5: Programs for Machine Learning, San Mateo, California: Morgan Kaufmann Publishers, Inc. His web site is: http://www.rulequest.com.
Ripley, B.D. (1996), Pattern Recognition and Neural Networks, Cambridge University Press.
Sonquist, J.A., Baker, E.L., and Morgan, J.N. (1971, 1975), Searching for Structure, Ann Arbor: Institute for Social Research, The University of Michigan. Out of print and hard to locate. User's manual for SEARCH (alias AID-III) tree construction OSIRIS program. An initial section relates their experience with