## Calculating Derivatives from Random Forests

###### (Edits to repair smudges, 2020-06-28, about 0945 EDT. Closing comment, 2020-06-30, 1450 EDT.)

There are lots of ways of learning about mathematical constructs, even about actual machines. One way is to push them into a place of improbable application, asking of them things which seem implausible, and watch how they fail. Indeed, they might not fail.

Many of the engineers from the NASA Apollo era spent their spare time with simulators of the CSM and LEM to develop a “feel” for what these machines could do and when they got into trouble. This allowed them to understand their capabilities beyond their formal specifications. This is quite possibly why Jack Garman was able to give the go-ahead for the Apollo 11 lunar landing, given a section of code was patched out, despite the overload of the LEM’s guidance computer from its rendezvous radar being set erroneously on in addition to the doppler landing radar.

Random forests were originally devised by Kleinberg and Ho (1990-2000), and then extended by Breiman, Cutler, Amit, and Geman. Breiman and Cutler are most frequently associated with these important statistical learning devices. They can be used for classification, regression, and variable selection. They can also be used for clustering.

Regression with random forests can be done on the same data that linear or generalized linear regression are done. Whereas the latter two means of calculation offer some continuity conditions which justify how one might use the outputs of their models to calculate, say, derivatives of responses with respect to predictors, it is less obvious how this might be done with random forests.

So, to set the stage, we begin with an application of random forests to time series. In fact, the series in question is reasonably famous, being the Air Passengers dataset introduced by Box & Jenkins and featured in the R package called forecast which, in its most recent rendition, offers the clever function $\text{auto.arima}$ as a means of automatically building ARIMA models for such series.

The plan is to regress with $\text{auto.arima}$ and regress with random forests, and see how the results compare. Then the next step is to figure out how to calculate derivatives from random forest predictions and then uncertainties in those, specifically prediction intervals. (See also on prediction intervals.)

After this post was written, I discovered the exciting paper …

H. Zhang, J. Zimmerman, D. Nettleton, D. J. Nordman, "Random Forest Prediction Intervals", (2019) The American Statistician, DOI: 10.1080/00031305.2019.1585288.

having this supplement.

It is also realized in the standalone R package rfinterval based upon the excellent ranger package.

In this paper, Zhang, Zimmerman, Nettleton, and Nordman overcome the evidence dearth problem encountered by quantile regression forests when trying to estimate prediction intervals, using a technique closely related to the conformal prediction methods in

J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, L. Wasserman, "Distribution-free predictive inference for regression", (2018) Journal of the American Statistical Association, 113:523, 1094-1111, DOI: 10.1080/01621459.2017.1307116.

Some preliminaries ….

Since the R forecast package has been mentioned, the ranger package should be introduced as the one which will be used to train and test random forests, where, for those unfamiliar, training is a calibration process. Other packages have elements which are critical to this development, namely, the random, coda, and fda.usc packages.

And, incidentally, the code and some supporting files from this blog post are available here.

How can anything be done with a set of dependent data like a time series using random forests? After all, random forest ensemble methods do not in themselves have any inherent applicability to dependent data. Response-covariates sets are considered independent of one another. But, what can be done is mimic the ARIMA setup. Define a set of covariates for a response $y_{t}$ in a series to be the union of the some subset of lagged values of the response, such as $\{y_{t-1}, y_{t-2}, \dots, y_{t-p}\}$, with some subset of lagged moving averages of the response. These are, in fact, for ARIMA, the basic features of the series which are used, along with some spectral theory, to produce the methods ARIMA employs.

In other words, the matrix of candidate covariates for responses looks like, for $p > q$: $\left[ \begin{array} {cccccccccc} y_{p} & y_{p-1} & y_{p-2} & \dots & y_{1} & a_{p,2} & a_{p,3} & \dots & a_{p,q)} & \tilde{\mu} \\ y_{1+p} & y_{p} & y_{p-1} & \dots & y_{2} & a_{(1+p),2} & a_{(1+p),3} & \dots & a_{(1+p),q)} & \tilde{\mu} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ y_{n-1} & y_{n-2} & y_{n-3} & \dots & y_{n-p} & a_{(n-1),2} & a_{(n-1),3} & \dots & a_{(n-1),q} & \tilde{\mu} \\ y_{n} & y_{n-1} & y_{n-2} & \dots & y_{1+n-p} & a_{n,2} & a_{n,3} & \dots & a_{n,q} & \tilde{\mu} \\ \end{array} \right]$ $a_{t,2} = \frac{y_{t-1} + y_{t-2}}{2} \\[0.5cm] a_{t,3} = \frac{y_{t-1} + y_{t-2} + y_{t-3}}{3} \\[0.5cm] \vdots \\[0.5cm] a_{t,q} = \frac{y_{t-1} + y_{t-2} + \dots + y_{1+t-q} + y_{t-q}}{q}$

The $y_{t-\ell}$ terms are the vector of $p$ responses corresponding to response $y_{t}$ but lagged by $\ell$ time steps. The $a_{t,j}$ are averages of response values from $y_{t-1}$ through $y_{t-j}$. $\tilde{\mu}$ is the median of $y_{1}, y_{2}, \dots, y_{n}$.

Why not use ARIMA? Any parametric method brings assumptions to a problem and the point of non-parametric methods, even modern ones like random forests, is to relax the assumptions that otherwise might be needed for estimation and prediction without costing too much. With random forests, there are ancillary benefits, like variable selection. Specifically, rather than choosing an order based upon autocorrelations and the like, as is the practice with random forests work, big a high number of lags and moving averages and let preliminary grows of a random forest indicate which are important for explaining the series.

It’s to be expected there will be drawbacks.

• The amount of computation required will be higher, because of the need for sampling and resampling for constructing trees.
• Because a large number of lags are needed to be considered, at least before variable selection is performed, the first parts of series cannot be analyzed as thoroughly as the remainders. More might be analyzed once variable selection indicates that, at least for the tail of the series, some older points of the series do not inform the present.
• A random forest model may be able to forecast, but it’s capability depends upon many complicated relationships between the in-sample (or training) portion of a series and the out-of-sample (or test) portion. To the degree the ARIMA framework is accepted and, I daresay, works, the premises of the in-sample portion by assumption extend into the out-of-sample portion, and so forecasting can work and be powerful. Should they be violated, however, the ARIMA forecast may be useless because some kind of changepoint is encountered, even if in most systems for which ARIMA works, such violent changes are unlikely. If they are unlikely, there should be some relationship between the character of the in-sample with the out-of-sample. The degree to which these same considerations extend to use of random forests remains, to my knowledge, to be explored. Nonetheless, tuning random forests for out-of-sample performance has produced some notable successes.

Nevertheless, it’s easy to see how random forest approaches to analyzing series might be useful. They are insensitive to natures of distributions, they are flexible to use, and, in contrast with some other statistical learning methods, comparatively transparent. Forests trained on different datasets can be merged and consolidated.

#####  See: G. Dudek, “Short-term load forecasting using random forests”, Intelligent Systems 2014., Springer, Cham, 2015, 821-828. A. Lahouar, J. Ben Hadj Slama, “Hour-ahead wind power forecast based on random forests”, Renewable Energy, 2017, 109, 529-541. W. Y. N. Naing, Z. Z. Htike. “Forecasting of monthly temperature variations using random forests”, APRN J. Eng. Appl. Sci, 2015, 10, 10109-10112.  Because $\tilde{\mu}$ is constant for any given series, it won’t be predictive in any fashion so could have been left out of the covariates matrix. However, were the forest trained on additional series, having differing medians might become predictive. Note that this implies both variable selection and regression with a reduced variable set need to be done on new data series. Note as well merging of forests may have to be done with the full original parent set of variables rather than with the reduced set in that case.

Right out of the gate, here’s how one random forest compares with the result from $\text{auto.arima}$: ###### (To see larger image, click on figure. It will open in a new window.)

The quality is entirely comparable to the ARIMA result. Note a check of the lags and moving averages picked by $\text{auto.arima}$ shows little overlap with what the random forest elects. What that means for the interpretability of Air Passengers behavior for one or the other of the techniques is beyond the scope of this blog post, but it poses an interesting scientific question.

The forest was produced as a result of a seven step process. Note that prior to building forests or any other step, a replica of the original series is made which is centered and scaled. In other words, the mean, $\mu_{y}$, and standard deviation, $\sigma_{y}$, of the original series is calculated, and a transformed version of the series $\hat{y}_{t} = \frac{y_{t} - \mu_{y}}{\sigma_{y}}$ is used in place of the original. This recentering is standard practice for good statistical and numerical properties. Results coming out of the model can be transformed back into the original measure by multiplying them by $\sigma_{y}$ and adding $\mu_{y}$ to the result.

First, a forest of 100,000 trees was grown to ascertain importance of predictors using permutation importance measures. Other details will be given below.

Second, a forest of 100,000 trees was grown to do the regression using the selected set of variables, and resulting in the figure above. The resulting $R^{2} = 0.996$ and the mean-squared-error was $0.0029$.

Third, the forest produced in the second step is used to make predictions for all the time points observed. These are used to produce a baseline series for the estimate of Air Passengers over time, and the estimates for that quantity’s first and second time derivatives.

Fourth, the forest produced in the second step is used to make predictions using each of the trees in the forest for all the time points observed. These are used to develop prediction errors in estimates of quantity, and its first and second time derivatives.

Fifth, whenever derivatives need to be estimated, the B-splines technique in the fda.usc package is used.

Sixth, uncertainty estimates in quantity, and its first and second time derivatives are obtained by bootstrapping the full set of individual predictions in the third step above. Each bootstrap of the (here) 200 bootstrap samples results in 100,000 time series predictions, one for a tree in the forest. (Because bootstrapping entails sampling with replacement the same series might be elected multiple times.) The value of the $k^{\text{th}}$ time point in the resulting series is the mean of all 100,000 $k^{\text{th}}$ points in the bootstrap sample. fda.usc splines are used to obtain smooth quantities, and first and second time derivatives from each of the bootstrap samples. To obtain prediction intervals for each of the time points in the series, the built-in R $\text{lm}$ and $\text{predict.lm}$ functions are used to produce them, using the 200 estimates at each time point.

Seventh, results are compiled and plotted.

#####  The number of variables to possibly split at each node was $\lceil \sqrt{m}\,\rceil$, where $m$ is the number of predictor variables. Initially that was 108 predictors, including 51 lagged responses, 1-51, 53 moving averages, for training windows of sizes 2 to 54, and the median value for the entire series. So, initially, $\lceil \sqrt{m}\,\rceil = 11$. After variable selection, with normalized variable importance scores having to be greater than 0.005, there was 31 variables used for regression. These included 7 lagged responses and 24 moving averages. Lags 1, 2, 3, and 33-38 were considered important by this criterion. The moving average windows of lengths 2-11, and 38-48 were considered important. This means for the regression, $\lceil \sqrt{m}\,\rceil = 6$.The minimum node size specified was 5. The “extra trees” splitting rule was used for both variable selection and regression. Up to 10 random trees per predictor were considered. The depth of trees was unbounded. Inbag errors were kept. 100,000 trees were generated for each forest. ###### (To see larger image, click on figure. It will open in a new window.)

The uncertainties about the predicted trace are difficult to see because they are small. Recall these are prediction errors. To give an idea, the next figure shows a portion of the same with a lens distortion applied: ###### (To see larger image, click on figure. It will open in a new window.)

Estimates of first and second time derivatives show similarly small prediction error clouds. Note how the first and second derivatives line up with changes in the number of Air Passengers over time in the figure below. ###### (To see larger image, click on figure. It will open in a new window.)

The following shows a close-up of the same for a limited interval. ###### (To see larger image, click on figure. It will open in a new window.)

In short, I conclude this technique to be a highly successful approach to estimating derivatives from random forests, even if it is limited to analyzing given series, and not forecasts.

I have changed my mind regarding that last paragraph. The perspective inherent in the regression forest approach above is atypical for many applications of random forests or, for that matter, statistical learning methods. There’s less emphasis upon prediction as a goal, and more on simply characterizing a given dataset. Accordingly, performance of a forest out-of-sample is less important than fidelity to a series of responses in sample. In this respect, random forests are being used very much like spline interpolators and, so, serve as the basis for calculating other numerical aspects of them, such as derivatives.

Also notable is the idea of bootstrapping the forest in order to get a sense of the internal variability in the model. After all, growing the forest from a given dataset does not produce a unique forest. Successive generations of the random forest will differ. Bootstrapping the forest gives some idea how it may have turned out a second time.

This consideration leads to a couple of other questions appropriate for investigation:

1. What is the equivalence class of forests look like which all reproduce a given series with less than some specified mean-squared-error?
2. Given such a class, what is the set of series which are classified by members of the class as being close enough to the original series that the forests will reproduce them with less than some specified mean-squared-error, perhaps the same as the m.s.e. associated with the class, or perhaps larger? 