Random forests: bootstrapping for the win

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned.

Algorithm

Bagging

The training algorithm for random forests applies the general technique of bootstrap aggregating, or bagging, to tree learners. Given a training set $X = x_{1}, \dots, x_{n}$ with responses $Y = y_{1}, \dots, y_{n}$ , bagging repeatedly ( $B$ times) selects a random sample with replacement of the training set and fits trees to these samples:

For $b = 1, \dots, B$ :

Sample, with replacement, $n$ training examples from $X$ , $Y$ ; call these $X_{b}$ , $Y_{b}$ .
Train a classification or regression tree $f_{b}$ on $X_{b}$ , $Y_{b}$ . After training, predictions for unseen samples $x^{'}$ can be made by averaging the predictions from all the individual regression trees on $x^{'}$ :

\hat{f} = \frac{1}{B} \sum_{b = 1}^{B} f_{b} (x^{'})

Random forests

The above procedure describes the original bagging algorithm for trees. Random forests also include another type of bagging scheme: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a random subset of the features. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the B trees, causing them to become correlated.

Typically, for a classification problem with p features, $\sqrt{p}$ (rounded down) features are used in each split. For regression problems the inventors recommend $p / 3$ (rounded down) with a minimum node size of $5$ as the default. In practice, the best values for these parameters should be tuned on a case-to-case basis for every problem.