Random Forests

Combining tree results using several trees: Tree 1 (weight .2) Tree 2 (weight .3) X1 X3 Pr{Y1 =1} Yi 1 0 1 1 X1i 18...

0 downloads 87 Views 123KB Size
Combining tree results using several trees: Tree 1 (weight .2)

Tree 2 (weight .3)

X1 X3

Pr{Y1 =1}

Yi 1 0 1 1

X1i 18 4 3 11

Tree 3 (weight .5)

X4 X2

X1

.6

Weights X2i X3i X4i X5i 3 5 7 4 16 12 3 8 8 9 12 2 10 2 7 12

X2

.3

.2 .3 p1 p2 .6 .3 .4 .4 .2 .1 .9 .8

.5 p3 .8 .5 .3 .7

X1 X5

X1

.8

Y-p combined Decsn. Error .61 1 1-.61 .45 0 0-.45 .22 0 1-.22 .77 1 1-.77

.

Misclass 0 0 1 0

The error sum of squares for this tree combination is .392+…+.232 and the misclassification rate is ¼. The blue dot in each tree indicates the leaf in which observation 1 lands, based on its X values (features). In tree 1, it lands in a leaf with 60% 1s so the probability of 1 is 0.6. The combined result for observation Y1 is .2(.6)+.3(.3)+.5(.8)=.61. The weights for each tree are related to the tree’s accuracy with more accurate trees getting more weight. Trees with larger error sums of squares or misclassification rates would get less weight. Bagging and random forests use equal weights but boosting uses weights based on accuracy. A program to compute the above table: data a; input p1 p2 p3 Y; combinedp = .2*p1+.3*p2+.5*p3; D = (combinedp>1/2); misclass=0; if D ne Y then misclass=1; error = (Y-combinedp); Sq = error*error; example+1; datalines; .6 .3 .8 1 .4 .4 .5 0 .2 .1 .3 1 .9 .8 .7 1 ; proc print; sum misclass Sq; var example p1 p2 p3 combinedp Y D misclass error Sq; Title "Tree weights are .2, .3, and .5"; run;

Methods for obtaining multiple trees. BOOSTING BAGGING and RANDOM FORESTS. Bagging (Bootstrap AGGregation) Sample the data with replacement so some observations will enter more than once, some not at all most likely - almost certainly in very large training data sets. We sample all variables (columns) but in sample 1, observations 2,5, and 7 are “out of bag” and act something like a validation set though the assessments seem to be pessimistic when computed in the out of bag data. Sample 1 X X X X X X X X X X X X X X X X X X X X Sample 2 X X X X X X X X X X X X X X X X X X X X X X X X X Etc. Do this sampling multiple times resulting in multiple trees. Because we are combining multiple trees, this is often implemented by building out small trees (depth down small) so that we are combining multiple weak classifiers in hopes of getting a strong one. By way of analogy, it is easy to tear one page of newsprint in half but tearing a complete newspaper with a single tear is quite difficult. Random Forests – This is perhaps the most popular of these methods. Recall that observations are the rows of our data set, one row for each item (e.g. each credit card applicant) and variables are the columns so age, gender, salary, etc. are variables. The idea is to take a random sample of both the variables and of the observations and grow a tree on this subset. Notice that this is a sort of bagging but using a different subset of the variables for each “bag.” Do this sampling multiple times resulting in several trees. If the depth exceeds one, new variables can be selected at each split. X

X X

X X

X X X X

X

X X

The thinking on these methods is that by selecting subsets independently, the results will be less correlated and thus the ultimate combined tree will be combined from results that are somewhat independent. Boosting (basic idea) A more detailed reference for this is chapter 10 in The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. For decisions, run a sequence of trees each time giving increased weight to the observations that were previously misclassified in hopes of getting them right this time. For estimation predictions, you take the deviations from the previous model’s predicted probabilities (1-p or 0-p) for each point. Perhaps there is a pattern to these residuals so you then grow a tree with weights based on these residuals (focus more on what you missed) to improve on the original prediction. In either case a sequence of trees is computed to classify a point or estimate its probability of yielding an event. Each has a weight associated with its accuracy and a weighted combination is suggested in TIbshirani et al (pg. 300) as the final classifier. Details of the weight computations are also given at the end of this document and on page 301 of Tibshirani et al. The sequence on the training data might go too far so the procedure will choose a model based on an assessment measure in the training data. The associated assessment statistic is also shown for the validation data. There are some variations on this. One variant, gradient boosting, is available from the Model subtab. A change of notation: In the data mining literature, it is often the custom to label events as 1 and nonevents as -1 rather than 0. This does not change the results but has some cosmetic advantages. For example, whether I predict an event or non-event, if I am right then multiplying the prediction by the result will always give 1 whereas if I am wrong it will always give -1. Zero rather than 0.5 now denotes the boundary for classifying. Also in a leaf with four nonevents and six events, if I average the 10 values I will get 6/10-4/10 = 0.6-0.4, the difference between the highest probability (of event or non-event) and the next highest one. This difference is defined for multiple levels and is referred to as the margin between the most likely and next most likely categories. Questions: Do we want leaves with a high or low margin? Do we want leaves with a high or low Gini coefficient? (Gini is 1 minus the sum of squared category probabilities). For categories red, green, blue with probabilities 0.5, 0.2, 0.3 what is the margin? The Gini coefficient is 1-.25-.04-.09 = 0.62. Notice that for all of these methods, you end up with multiple trees. How do you then predict observations? For bagging, multiple trees are also combined. The idea of combining multiple models is called ensemble modelling. The models can even be of different types. This will be discussed in a more general setting later. For trees, if you are interested in decisions one method is to run an observation through all of the trees, as in the book illustration with the yellow and blue dots, and then decide based on the majority decision. This is called the “voting” method of combining models. For estimates, you can just average the probabilities from the models. You can also do a weighted combination with weights computed using the accuracy of each tree so trees with larger error mean square or misclassification rate are given less weight which is typically applied for the methods discussed here. See the optional weights section below for details. Because there are many trees, this method will not produce some of the output that we are used to, such as a tree diagram (there isn’t just one tree). In random forests, you will also see statistics labelled as “out of bag.” Recall that each sample most likely does not include all the observations. The observations from the training data that are not included in

the sample (the bag) are called out of bag observations and thus they form something like a validation data set, a different one for each sample. A true validation data set is preferred if available. Enterprise Miner provides several HP programs where HP stands for high performance. These have been sped up with programming and by taking advantage of distributed processing when available. If you have several processors, each sample can be assigned to a different processor so that the time for processing all of the samples is the same as for growing a single tree. The procedure HPFOREST is one such program for random forests. Here is a quick random forest demo for our veterans data. (1) Find the HPDM subtab and select the HP forest icon. (2) Inspect the properties panel and note that some of the tree defaults are a bit different than those for the ordinary tree node. (3) Connect the HP forest node to the data partition node in your veterans diagram. (4) Run the node and inspect the results. Did it take a long time to run? (5) Later we will see how to export the score code to score future data sets. The gain is in accuracy of the results but we also loose the simple interpretability of a single tree. Boosting: Similarly try a gradient boosting approach (model subtab) to the veterans data. In this demo you will see at least one variable (demcluster) that is character with 54 levels. When you split, one part will involve level A (let’s say) of this class variable. For that part, each of the 53 other levels can be included or not. That means there are 253 ways of constructing the part of the split containing level A which is also the total number of possible splits except for the fact that it includes a “split” where all the levels are included with A so the total number of feasible splits is 253 – 1. This variable would have lots of chances to give a high Chi-square (& logworth). You will see that demcluster has importance 1 in the training data and 0 in the validation data for this reason. Use the metadata node to change the role of demcluster to “rejected.” Alternatively, use the Variables ellipsis to not use the variable. Optional: Weights for individual trees in a boosting sequence using misclassification: wi = weight for observation i in tree m. Start with weight 1/N where there are N observations. Relabel the two classes using Y=1 for an event, Y=−1 for a non-event. Same for classification (classify as 1 or −1). This way Y times classification is 1 if correct, −1 if incorrect. (1)(1)=1, (−1)(−1)=1. 1. Sum the weights wi over all misclassified observations. Divide this by the sum of all N weights. This we will call error_m (for tree m out of a list of M trees). It is between 0 and 1 and thus can be thought of as a weighted probability of misclassification in tree m. You want to minimize error_m as you build tree m. 2. Compute the logit L of the weighted probability of correct classification. This is L=ln( (1−err_m)/err_m ) > ln(1)=0. It is nonnegative since you’d never misclassify more than half of the observations.

3. For the next tree, leave wi the same for correctly classified observations and multiply the old wi by eL for observations incorrectly classified. This focuses the next tree on the observations its predecessor misclassified. Repeat for m=1,2,…,M to get M trees. 4. At the end, consider a point and its set of features. Run that point through all M trees to get your decision. Multiply each tree’s decision (1 or −1) by the tree’s weight and add. If the weighted sum is >0, call it an event (1) and if not call it a nonevent (−1). Boosting for estimates follows a similar process. Bagging weights: For bagging, the idea is to get columns of predictions fj(x) from each tree. Now do a multiple regression of what actually happened, Y, on the columns of predictions f1(x),f2(x),f3(x), …, fm(x) to get a weighted average of the prediction where the weights are the regression coefficients. You would probably want to get least squares coefficients that are restricted to all be positive and to add to 1 (weighted average).