Tree-Based Regression Methods (Part II): Random Forests

July 18, 2020

Introduction

In the previous post, we introduced the concept of decision trees for the purpose of regression in the context of supervised learning. Decision trees are very simple models, which are easy to understand and apply, but which suffer from rather poor performance as they tend to be fairly biased towards the training data. Without deliberate measures to limit the complexity of constructed trees, we may potentially end up with trees where each leaf contains exactly one training sample. Imposing limits on the tree depth, the minimum number of samples required in a leaf node, or the minimum number of samples to split an internal node can all help improve the generalization of trees. However, the performance on unseen data ultimately remains rather poor.

One common way to combat this effect is by considering ensembles of trees, where each tree in the ensemble "votes" on the final prediction. Random forests, the topic of this post, are a popular method in this category which consider randomized ensembles. In particular, random forests grow a collection of decision trees in isolation. On the one hand this means that the construction procedure is simple and easily parallelizable. On the other hand it means there is no mechanism to gradually improve the performance of the ensemble based on previously constructed trees. In contrast, boosting methods such as AdaBoost or gradient-boosted trees, which we will discuss in upcoming posts, incrementally improve the performance of the ensemble as it grows.

The rest of the post is structured as follows. We first explain how an existing random forest is used to perform prediction on new samples. We then briefly explain how random forests are constructed, before going through a simple Python implementation that builds on the code we discussed last time in the context of our decision tree regressor.

The Python code we will be discussing below can be found under the following tag of the Github repository: https://github.com/nkoep/fundamental-ml/tree/v2-random-forest.

Prediction via Ensembles of Trees

As alluded to before, random forests are conceivably simple to apply in practice. Given an ensemble of trained decision tree regressors, ensemble methods such as random forests simply combine the individual predictions into a consensus prediction of the entire ensemble. In a classification task, this is a simple majority vote, i.e., the most commonly predicted class among all trees wins, whereas in the context of regression, the target predictions of all trees are averaged to form the final prediction of the ensemble. More concretely, consider a family of decision trees F={fi ⁣:RpRi=1,,T}\family = \setpred{\function{f_i}{\R^\nfeat}{\R}}{i = 1, \ldots, \nest} collectively forming the basis of the random forest.[1] Given an unseen observation xRp\vmx \in \R^\nfeat, the random forest regressor now simply returns

y^=1TfFf(x) \yhat = \frac{1}{\nest} \sum_{f \in \family} f(\vmx)

as its final prediction. And that's all there is to it. Since all trees are created in the same if randomized way (the details of which we will discuss next), each tree's prediction contributes equally to the final prediction. This is in stark contrast to some of the more advanced methods we'll be looking at in future posts, where different members of the ensemble might have a stronger influence on the final prediction than others.

Seeing the Random Forest for the Decision Trees[2]

In this section, we briefly go over the construction of random forests. Since a random forest is just a collection of trees trained on independently sampled subsets of the training set, the most complicated aspect of constructing a random forest was already discussed in the previous post.

Bootstrapping, Aggregating and Bagging

Since regression trees suffer from rather poor generalization (i.e., high variance in the bias-variance trade-off), the fundamental idea of random forests is to inject some variation into the training procedure. In particular, we consider T\nest different trees, which are all trained on different subsets of the training data. The intuition here is that while each individual tree might slightly overfit the samples it was trained on, by averaging the predictions of each individual tree this effect may be reduced, which in turn reduces variance on unseen data.

Subsampling the training set is more commonly referred to as bootstrapping; a training (sub)set created in this way is called a bootstrapped set. The combination of bootstrapping the training set and aggregating the individual predictions to form the ensemble's final prediction is what is generally known as bagging. This principle is not restricted to decision trees but can be applied to any classification or regression method in supervised learning.

Apart from randomly subsampling the training data, random forests inject an additional source of randomness into the training procedure by limiting and randomizing the features individual trees are allowed to consider when splitting internal nodes. This often helps avoid the tendency of trees to split on the same features as other trees in the ensemble due to highly correlated, dominant features, even if they are trained on bootstrapped samples. A closely related ensemble method called extremely randomized trees (or extra trees for short), takes things one step further. Instead of choosing the best split threshold based on the samples in a node, they randomly generate a set of candidate thresholds and pick the threshold with the best score to further decouple individual trees from the (bootstrapped) training set, thus potentially reducing variance. For simplicity, we will limit ourselves to random forests in this post.

Random Sampling of the Training Set

While the number of samples drawn from the training set (with replacement) to form the bootstrapped sample is usually around k/3\nsamp / 3 in classification tasks, in regression it is more common to draw k\nsamp samples from the training set. This may seem slightly counterintuitive at first glance as one might assume that the bootstrapped set almost coincides with the entire training set. Intuitively, the probability of selecting each sample only once is rather small, so we can generally expect that a certain fraction of samples is chosen multiple times. The question remains though how many unique samples are drawn on average.

To frame this question mathematically, let SS be a random subset of [k]:={1,,k}[\nsamp] \defeq \set{1, \ldots, \nsamp} generated by picking k\nsamp values uniformly at random from [k][\nsamp] with replacement. Then S\card{S} is a discrete random variable supported on [k][\nsamp]. In order to evaluate ES\E\card{S}, we'll use a common trick in probability theory by expressing S\card{S} in terms of indicator functions. In particular, denote by Ei\event_i the event that iSi \in S. Then S=i=1kχEi|S| = \sum_{i=1}^\nsamp \ind{\event_i}, where χEi\ind{\event_i} is 1 if Ei\event_i happens, and 0 otherwise. By linearity of expectation, this yields

ES=i=1kEχEi=i=1kP(Ei)=i=1k(1P(Eiˉ)). \E\card{S} = \sum_{i=1}^\nsamp \E\ind{\event_i} = \sum_{i=1}^\nsamp \P(\event_i) = \sum_{i=1}^\nsamp (1 - \P(\comp{\event_i})).

It remains to estimate the probability of the complementary event Eiˉ\comp{\event_i}, i.e., we never pick element ii when drawing k\nsamp elements from [k][\nsamp]. Since each element of [k][\nsamp] is equally likely, we have by independence of individual draws that

P(Eiˉ)=(k1k)k. \P(\comp{\event_i}) = \parens{\frac{\nsamp - 1}{\nsamp}}^\nsamp.

Overall, this yields

ES=k(1(k1k)k). \E\card{S} = \nsamp\parens{1 - \parens{\frac{\nsamp - 1}{\nsamp}}^\nsamp}.

Moreover, by Hoeffding's inequality the random variable S\card{S} concentrates sharply around its mean. To see this, note that by the previous representation of S\card{S}, we have

SES=i=1k(χEiEχEi)=i=1k(χEi1+(k1k)k)=:i=1kXi. \card{S} - \E\card{S} = \sum_{i=1}^\nsamp (\ind{\event_i} - \E\ind{\event_i}) = \sum_{i=1}^\nsamp \parens{\ind{\event_i} - 1 + \parens{\frac{\nsamp-1}{\nsamp}}^\nsamp} \eqdef \sum_{i=1}^\nsamp X_i.

Clearly, EXi=0\E X_i = 0 and Xi1\abs{X_i} \leq 1 a.s. for all i[k]i \in [\nsamp]. Hence Hoeffding's inequality states that for t>0t > 0,

P(SESt)=P(i=1kXit)2exp(t22k). \P(\abs{\card{S} - \E\card{S}} \geq t) = \P\parens{\abs{\sum_{i=1}^\nsamp X_i} \geq t} \leq 2 \exp\parens{-\frac{t^2}{2\nsamp}}.

In other words, the probability that S\card{S} deviates significantly from its mean decays exponentially fast.

Note that with the limit representation ex=limn(1+x/n)ne^x = \lim_{n \to \infty} (1 + x/n)^n, for large k\nsamp we roughly have that

ESk(1e1)0.63212k. \E\card{S} \approx \nsamp (1 - e^{-1}) \approx 0.63212 \nsamp.

This means that by drawing k\nsamp samples uniformly at random from the training set of size k\nsamp with replacement, on average we will use around 2/3 of the training set to fit each individual tree in the ensemble.

Python Implementation of a Random Forest Regressor

We now turn to the Python implementation of a random forest regressor, leveraging the Tree class we wrote last time to implement the core logic of regression trees. As the informal runtime benchmark in the last post alluded to, our implementation is rather slow. In general, one requires O(kp)\bigO(\nsamp \nfeat) operations to find the best split in each internal node. This will come back to bite us now since we have to train several independent trees instead of just one, albeit with fewer samples per tree than before. While we could train each tree in parallel to try improve runtime performance, a simpler approach to speed things up a bit is through Numba.

Speeding Up Decision Tree Fitting with Numba

Numba is a just-in-time (JIT) compiler for Python that utilizes the LLVM ecosystem to compile Python bytecode to efficient machine code. The dynamic nature of Python makes this process rather difficult if not impossible for arbitrary Python code. For the type of code regularly encountered in scientific computing, however, Numba can often generate incredibly fast code, especially due to its first-class support for NumPy arrays.

In its simplest form, it suffices to decorate a function with the njit decorator imported from the numba package to unlock the benefits of JIT compilation. In reality, however, there are usually a few tweaks and changes necessary to enable Numba to do its magic. In the interest of brevity, we skip any further details at this point and simply refer to the Numba documentation. With a few minor changes to our Tree class (see https://git.io/JJZbz), the runtime of our decision tree example from the previous post improves from 1.7775 to 0.1233 seconds. This puts us in a decent starting position to build our random forest regressor.

Random Forest Regressor

We finally turn to the implementation of our RandomForest class. Due to the simple construction procedure of random forests, the implementation is very concise. We begin with the constructor.

import numpy as np
from sklearn.base import BaseEstimator, RegressorMixin

from tree import Tree


class RandomForest(BaseEstimator, RegressorMixin):
    def __init__(self, n_estimators=100, min_samples_split=2,
                 max_features=None, random_state=None):
        self.n_estimators_ = n_estimators
        self.min_samples_split_ = min_samples_split
        self.max_features_ = max_features
        self.random_state_ = random_state

        self._trees = []

The most important parameter, which should generally be optimized via hyperparameter tuning, is n_estimators, controlling the number of regressors in the ensemble. To keep things simple, we ignore any other parameters scikit-learn's RandomForestRegressor supports except for min_samples_split, max_features and random_state. The max_features parameter limits the size of the randomly chosen feature set to consider during internal splits. The random_state parameter is necessary to seed our random number generator (RNG) with a fixed seed to make training reproducible across different runs.

The rest of the implementation is fairly self-explanatory.

    def fit(self, X, y):
        X, y = map(np.array, (X, y))

        rng = np.random.default_rng(self.random_state_)
        num_samples = X.shape[0]

        for _ in range(self.n_estimators_):
            tree = Tree(self.min_samples_split_, self.max_features_, rng)
            indices = rng.integers(num_samples, size=num_samples)
            tree.construct_tree(X[indices, :], y[indices])
            self._trees.append(tree)

    def _predict_sample(self, x):
        return np.array(
            [tree.apply_to_sample(x) for tree in self._trees]).mean()

    def predict(self, X):
        if not self._trees:
            raise RuntimeError("Estimator needs to be fitted first")
        return np.array([self._predict_sample(row) for row in np.array(X)])

In the fit method, we instantiate an RNG seeded with random_state, and then generate a bootstrapped dataset for each individual tree. We also change the constructor of the Tree class to accept a reference to our internal RNG in order to randomize the feature set if max_features is specified. The changes to the Tree class's construct_tree method to accommodate random feature selection are also straightforward (see https://git.io/JJnq7).

To see how this implementation fares against scikit-learn's RandomForestRegressor, we train both algorithms with ensembles of 25 trees and limit the number of features to consider in each split (max_features) to 5. As before, we evaluate the performance of both implementations on the Boston house-prices dataset.

$ python test_random_forest_regressor.py
sklearn
-------
MAE: 2.6788661417322834
R^2 score: 0.6184095609224545
Time elapsed: 0.042606 seconds

naive
-----
MAE: 2.6311585763318037
R^2 score: 0.6216081856563564
Time elapsed: 1.024935 seconds

The results show a clear improvement over the predictive performance of the simple decision tree regressors we considered last time. This confirms that while estimators in the ensembles might be less optimized individually, randomization and aggregation overall helps improve the prediction performance over single trees. Once again the prediction accuracy of both implementations is very close, with the scikit-learn version having a significant edge in terms of runtime as expected. However, by leveraging Numba in our Tree class, we manage to bring the training time of our random forest down from around 10 seconds to 1 second.

Closing Remarks

Random forests constitute a simple extension of decision trees that build on ensembles of independently constructed trees trained on randomly sampled subsets of the training set. This makes random forests easy to implement and fairly efficient to train, assuming a fast construction algorithm for single decision trees. By randomizing both the training set and features considered when splitting internal nodes of decision trees in the ensemble, random forests tend to reduce overfitting, and thereby improve generalization and reduce variance.

Due to the independence of individual trees in the ensemble, random forests lend themselves well to parallel training, which can be very beneficial for very large datasets. However, the independent nature in which individual estimators of the ensemble are trained leaves a lot of unused potential to improve predictive performance on the table. In the next post, we'll be looking at a classical example of an ensemble method based on boosting, which gradually improves the performance of the ensemble as training progresses.


  1. We emphasize that we don't mean a basis in the linear algebraic sense here. ↩︎

  2. Sorry, I couldn't help myself. ↩︎