Stream flow and P-splines: Using built-in estimates for smoothing

Posted on 28 January 2019 by ecoquant

Mother Brook in Dedham Massachusetts was the first man-made canal in the United States. Dug in 1639, it connects the Charles River at Dedham, to the Neponset River in the Hyde Park section of Boston. It was originally an important source of water for Dedham’s mills. Today it serves as an important tool for flood control on the Charles River.

mb_img_20171216_151349-01

Like several major river features, Mother Brook is monitored by gauges of flow maintained by the U.S. Geological Survey, with careful eyes kept on their data flows by both agencies of the Commonwealth of Massachusetts, like its Division of Ecological Restoration, and by interested private organizations, like the Neponset River Watershed Association and the Charles River Watershed Association. (I am a member of the Neponset River Watershed Association.) The data from these gauges are publicly available.

Such a dataset is a good basis for talking about a non-parametric time series smoothing technique using P-splines (penalized B-splines), an example of local regression, and taking advantage of the pspline package to do it. Since this, like most local regression techniques, demands a choice of a smoothing parameter, this post strongly advocates for pspline as a canonical technique because:

it features a built-in facility for choosing the smoothing parameter, one based upon generalized cross validation,
~~like loess and unlike lowess in R, it permits multiple response vectors and fits all of them simultaneously, and~~
with the appropriate choice in its norder parameter, it permits the estimation of derivatives of the fitted curve as well as the curve itself.

Finally, note that while residuals are not provided directly, they are easy to calculate, as will be shown here.

In fairness, note that loess allows an R formula interface, but both smooth.Pspline and lowess do not. Also, smooth.Pspline is:

intolerant of NA values, and
demands the covariates each be in ascending order.

Note from 2019-01-30

Note that the lack of support by the pspline package for the multivariate case has thrown, so to speak, the gauntlet down, in order to find a replacement. Since I’m the one who, in the moment, is complaining the loudest, the responsibility falls to me. So, accordingly, I commit to devising a suitable replacement. I don’t feel constrained by the P-spline approach or package, although I think it foolish not to use it if possible. Such a facility will be the subject of a future blog post. Also, I’m a little joyful because this will permit me reacquaintance with some of the current FORTRAN language definition, using the vehicle of Simply Fortran, and its calling from R. This is sentimental, since my first programming language was FORTRAN IV on an IBM 1620.

References

J. O. Ramsay, N. Heckman, B. W. Silverman, “Spline smoothing with model-based penalties“, Behavior Research Methods, Instruments. & Computers, 1997, 29(1).99-106.
P. H. C. Eilers, B. D. Marx, M. Durbán, “Twenty years of P-splines“, SORT, July-December 2015, 39(2), 1-38.
T. Krivobokova, G. Kauermann, “A note on penalized spline smoothing with correlated errors“, Journal of the American Statistical Association, 2007, 102(480), 1328-1337.
T. Krivobokova, C. M Crainiceanu, G. Kauermann, “Fast adaptive penalized splines“, Journal of Computational and Graphical Statistics, 2008, 17(1), 1-20.
P. Craven, G. Wahba, “Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation“, Numerische Mathematik, 1979, 31, 377-403.
G. Wahba, Spline Models for Observational Data, Society for Industrial and Applied Mathematics, 1990, CBMS-NSF.

For completeness, consider the AdaptFit package and related SemiPar package which also offer penalized spline smoothing but are limited in their support for multiple responses.

(Update, 2019-01-29)

I re-encountered this paper by Professor Michael Mann from 2004 which addresses many of these issues:

M. E. Mann, “On smoothing potentially non-stationary climate time series“, Geophysical Research Letters, 2004, 31, L07214, doi:10.1029/2004GL019569.

Incidentally, Professor Mann is in part responding to a paper by Soon, Legates, and Baliunas (2004) criticizing estimators of long term temperature trends. The Dr Soon of that trio is the famous one from the Heartland Institute who has been mentioned at this blog before.

The dataset

What’s does stream flow on Mother Brook look like? Here’s eight years of it:

(Click on image for a larger figure, and use browser Back Button to return to blog.)

Smoothing with P-splines, Generalized Cross Validation

Using a cubic spline model, the package pspline finds a smoothing parameter (“spar“) of 0.007 is best, giving a Standard Error of the Estimate (“SEE”) of 0.021:

(Click on image for a larger figure, and use browser Back Button to return to blog.)

Forcing the spline fit to use spar values which are larger, one of 0.5, and one of 0.7 produces a worse fit. This can also be seen in their larger G.C.V criteria, of 228 and of 237, compared with the automatic 185:

(Click on image for a larger figure, and use browser Back Button to return to blog.)

Code

The code for generating these results is shown below.


#
# Mother Brook, P-spline smoothing, with automatic parameter selection.
# Jan Galkowski, bayesianlogic.1@gmail.com, 27th January 2019.
# Last changed 28th January 2019.
#

library(random)   # For external source of random numbers
library(FRACTION) # For is.wholenumber
library(tseries)  # For tsbootstrap
library(pspline)

source("c:/builds/R/plottableSVG.R")

randomizeSeed<- function(external=FALSE)
{
  #set.seed(31415)
  # Futz with the random seed
  if (!external)
  {
    E<- proc.time()["elapsed"]
    names(E)<- NULL
    rf<- E - trunc(E)
    set.seed(round(10000*rf))
  } else
  {
    set.seed(randomNumbers(n=1, min=1, max=10000, col=1, base=10, check=TRUE))
  }
  return( sample.int(2000000, size=sample.int(2000, size=1), replace=TRUE)[1] )
}

wonkyRandom<- randomizeSeed(external=TRUE)

stopifnot( exists("MotherBrookDedham") )

seFromPspline<- function(psplineFittingObject, originalResponses, nb=1000, b=NA)
{
  stopifnot( "ysmth" %in% names(psplineFittingObject) )
  #
  ysmth<- psplineFittingObject$ysmth
  #
  if (is.null(dim(originalResponses)))
  {
    N<- length(which(!is.na(ysmth)))
    stopifnot( length(originalResponses) == N )
  } else
  {
    stopifnot( all( dim(originalResponses) == dim(ysmth) ) )
    N<- nrow(ysmth)
  }
  #
  if (is.na(b))
  {
    b<- round(N/3)
  } else
  {
    stopifnot( is.wholenumber(b) && (4 < b) && ((N/100) < b) )
  }
  #
  R<- originalResponses - ysmth
  #
  # Don't assume errors are not correlated. Use the Politis and Romano stationary
  # bootstrap to obtain estimates of standard deviation(s) and Mean Absolute Deviation(s), 
  # where these are plural of there is more than one response.
  #
  # The standard error of the estimate is then just adjusted for the number of non-NA
  # observations.
  #
  if (is.null(dim(originalResponses)))
  {
    Ny<- 1
    booted.sd<- tsbootstrap(x=R, nb=nb, statistic=function(x) sd(x, na.rm=TRUE), m=1, b=b, type="stationary")
    SD<- mean(booted.sd$statistic)
    SEE<- SD/sqrt(N)
    booted.mad<- tsbootstrap(x=R, nb=nb, statistic=function(x) mad(x, constant=1, na.rm=TRUE), m=1, b=b, type="stationary")
    MAD<- mean(booted.mad$statistic)
  } else
  {
    Ny<- ncol(ysmth)
    SD<- rep(NA, Ny)
    SEE<- rep(NA, Ny)
    MAD<- rep(NA, Ny)
    for (j in (1:Ny))
    {
      nonNA<- which(!is.na(R[,j]))
      booted.sd<- tsbootstrap(x=R[nonNA,j], nb=nb, statistic=function(x) sd(x, na.rm=TRUE), m=1, b=b, type="stationary")
      SD[j]<- mean(booted.sd$statistic)
      SEE[j]<- SD/sqrt(length(nonNA))
      booted.mad<- tsbootstrap(x=R[nonNA,j], nb=nb, statistic=function(x) mad(x, constant=1, na.rm=TRUE), m=1, b=b, type="stationary")
      MAD[j]<- mean(booted.mad$statistic)
    }
  }
  return(list(multivariate.response=!is.null(dim(originalResponses)), number.of.responses=Ny,
              SD=SD, MAD=MAD, SEE=SEE))
}

MotherBrookDedham.nonNA<- which(!is.na(MotherBrookDedham$gauge))
# Note method == 3 is Generalized Cross Validation (Craven and Wahba, 1979), and
# the value of spar is an initial estimate. The choice of norder == 2 is arbitrary.
MotherBrookDedham.fitting<- smooth.Pspline( x=MotherBrookDedham.nonNA, y=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], 
                                            norder=2, spar=0.3, method=3)
# Using 90 days as mean block length, about a quarter of a year
MotherBrookDedham.estimate.bounds<- seFromPspline(psplineFittingObject=MotherBrookDedham.fitting, 
                                                  originalResponses=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], nb=1000, b=91)

fx<- openSVG(root="MotherBrookDedham-RawFlowData-Daily-withSmooth", width=24, height=round(24/2), pointsize=8)

plot(MotherBrookDedham$gauge, type="n", xaxt="n", ylab="mean (over day) cubic feet per second", main="",
     xlab="", cex.lab=2, cex.axis=2, ylim=c(-80, 650))
title(main=sprintf("Raw flow data, Mother Brook at Dedham, agency %s, site %s, fit with cubic smoothing spline", 
      MotherBrookDedham$agency_cd[1], MotherBrookDedham$site_no[1]), 
      cex.main=3, font.main=2, family="Times")     
N<- nrow(MotherBrookDedham)
S<- seq(1, N, 30)
axis(side=1, at=S, line=-13, labels=MotherBrookDedham$datetime[S], las=2, cex.axis=2, font.axis=2, cex.lab=1.5, tick=FALSE)
abline(v=S, lty=6, col="grey")
points(1:N, MotherBrookDedham$gauge, pch=21, cex=1.2, col="blue", bg="blue")
lines(MotherBrookDedham.nonNA, MotherBrookDedham.fitting$ysmth, lwd=1, lty=1, col="green")
text(which.max(MotherBrookDedham.fitting$ysmth), max(MotherBrookDedham.fitting$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("Found smoothing SPAR = %.3f, and G.C.V. value =  %.1f", 
                                   MotherBrookDedham.fitting$spar, MotherBrookDedham.fitting$gcv), family="Helvetica")
text(which.max(MotherBrookDedham.fitting$ysmth), 0.95*max(MotherBrookDedham.fitting$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("SD = %.3f, MAD = %.3f, SEE = %.3f", 
                                   MotherBrookDedham.estimate.bounds$SD, MotherBrookDedham.estimate.bounds$MAD, 
                                   MotherBrookDedham.estimate.bounds$SEE), family="Helvetica")
closeSVG(fx)

# Force the same P-spline to use an arbitrary smoother SPAR by electing method == 1, and setting SPAR = 0.5.
MotherBrookDedham.fitting.p5<- smooth.Pspline( x=MotherBrookDedham.nonNA, y=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], 
                                            norder=2, spar=0.5, method=1)
# Using 90 days as mean block length, about a quarter of a year
MotherBrookDedham.estimate.bounds.p5<- seFromPspline(psplineFittingObject=MotherBrookDedham.fitting.p5, 
                                                  originalResponses=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], nb=1000, b=91)

fx<- openSVG(root="MotherBrookDedham-RawFlowData-Daily-withSmooth-with-SPARp5", width=24, height=round(24/2), pointsize=8)

plot(MotherBrookDedham$gauge, type="n", xaxt="n", ylab="mean (over day) cubic feet per second", main="",
     xlab="", cex.lab=2, cex.axis=2, ylim=c(-80, 650))
title(main=sprintf("Raw flow data, Mother Brook at Dedham, agency %s, site %s, fit with cubic smoothing spline", 
      MotherBrookDedham$agency_cd[1], MotherBrookDedham$site_no[1]), 
      cex.main=3, font.main=2, family="Times")     
N<- nrow(MotherBrookDedham)
S<- seq(1, N, 30)
axis(side=1, at=S, line=-13, labels=MotherBrookDedham$datetime[S], las=2, cex.axis=2, font.axis=2, cex.lab=1.5, tick=FALSE)
abline(v=S, lty=6, col="grey")
points(1:N, MotherBrookDedham$gauge, pch=21, cex=1.2, col="blue", bg="blue")
lines(MotherBrookDedham.nonNA, MotherBrookDedham.fitting.p5$ysmth, lwd=1, lty=1, col="green")
text(which.max(MotherBrookDedham.fitting.p5$ysmth), max(MotherBrookDedham.fitting.p5$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("Found smoothing SPAR = %.3f, and G.C.V. value =  %.1f", 
                                   MotherBrookDedham.fitting.p5$spar, MotherBrookDedham.fitting.p5$gcv), family="Helvetica")
text(which.max(MotherBrookDedham.fitting.p5$ysmth), 0.95*max(MotherBrookDedham.fitting.p5$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("SD = %.3f, MAD = %.3f, SEE = %.3f", 
                                   MotherBrookDedham.estimate.bounds.p5$SD, MotherBrookDedham.estimate.bounds.p5$MAD, 
                                   MotherBrookDedham.estimate.bounds.p5$SEE), family="Helvetica")
closeSVG(fx)

# Force the same P-spline to use an arbitrary smoother SPAR by electing method == 1, and setting SPAR = 0.7.
MotherBrookDedham.fitting.p7<- smooth.Pspline( x=MotherBrookDedham.nonNA, y=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], 
                                            norder=2, spar=0.7, method=1)
# Using 90 days as mean block length, about a quarter of a year
MotherBrookDedham.estimate.bounds.p7<- seFromPspline(psplineFittingObject=MotherBrookDedham.fitting.p7, 
                                                  originalResponses=MotherBrookDedham$gauge[MotherBrookDedham.nonNA], nb=1000, b=91)

fx<- openSVG(root="MotherBrookDedham-RawFlowData-Daily-withSmooth-with-SPARp7", width=24, height=round(24/2), pointsize=8)

plot(MotherBrookDedham$gauge, type="n", xaxt="n", ylab="mean (over day) cubic feet per second", main="",
     xlab="", cex.lab=2, cex.axis=2, ylim=c(-80, 650))
title(main=sprintf("Raw flow data, Mother Brook at Dedham, agency %s, site %s, fit with cubic smoothing spline", 
      MotherBrookDedham$agency_cd[1], MotherBrookDedham$site_no[1]), 
      cex.main=3, font.main=2, family="Times")     
N<- nrow(MotherBrookDedham)
S<- seq(1, N, 30)
axis(side=1, at=S, line=-13, labels=MotherBrookDedham$datetime[S], las=2, cex.axis=2, font.axis=2, cex.lab=1.5, tick=FALSE)
abline(v=S, lty=6, col="grey")
points(1:N, MotherBrookDedham$gauge, pch=21, cex=1.2, col="blue", bg="blue")
lines(MotherBrookDedham.nonNA, MotherBrookDedham.fitting.p7$ysmth, lwd=1, lty=1, col="green")
text(which.max(MotherBrookDedham.fitting.p7$ysmth), max(MotherBrookDedham.fitting.p7$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("Found smoothing SPAR = %.3f, and G.C.V. value =  %.1f", 
                                   MotherBrookDedham.fitting.p7$spar, MotherBrookDedham.fitting.p7$gcv), family="Helvetica")
text(which.max(MotherBrookDedham.fitting.p7$ysmth), 0.95*max(MotherBrookDedham.fitting.p7$ysmth), pos=2, offset=2,
     font=2, cex=2, labels=sprintf("SD = %.3f, MAD = %.3f, SEE = %.3f", 
                                   MotherBrookDedham.estimate.bounds.p7$SD, MotherBrookDedham.estimate.bounds.p7$MAD, 
                                   MotherBrookDedham.estimate.bounds.p7$SEE), family="Helvetica")
closeSVG(fx)

The code is available online here and requires a utility from here.

So, what’s the point?

Having a spline model for a data actually offers a lot. First, the estimate of SEE and MAD give some idea of how accurate prediction using the model might be. With eight years of data, such models are in hand.

Also, having a spline model is the basis for detecting changes in stream flow rates over time. Mother Brook might not be the best example of long run stream flow rates, since the Army Corps can change their policies in how they manage it, but the same kinds of flow time series are available for many other flows in the region.

To the point about changes in flow rates, having a spline model permits estimating derivatives which, in this case, are exactly these values.

Moving on, once several such flows have been modeled using splines, these can serve as the basis for various kinds of regressions, whether on the response side or on the covariates side. For example, is there statistical evidence for a link between stream flows and temperature? The Clausius-Clapeyron relation suggests there should be, at least at the regional and global scale. It would be interesting to examine if it can be seen here.

To me, it would be also interesting to see if some of the riverine connections in the region could be inferred from examination of flow rates alone. Downstream flows see a pulse of water from precipitation and melt, but their pulses are lagged with respect to earlier ones. Sure, one could examine such connections simply by looking at a map, or Google Earth, but there are other hydrological applications where these connections are latent. In particular, connections between subterranean water sources and surface flows might be reveals if these kinds of inferences are applied to them.

(Update, 2019-01-29)

The scholarly literature such as the paper by Professor Mann cited above which critiques and explains that by by Soon, Legates, and Baliunas (2004) shows careful consideration of these techniques matters.

mb_ortho_2019-01-27_180158

About ecoquant

See https://wordpress.com/view/667-per-cm.net/ Retired data scientist and statistician. Now working projects in quantitative ecology and, specifically, phenology of Bryophyta and technical methods for their study.

View all posts by ecoquant →

This entry was posted in American Statistical Association, citizen data, citizen science, Clausius-Clapeyron equation, Commonwealth of Massachusetts, cross-validation, data science, dependent data, descriptive statistics, dynamic linear models, empirical likelihood, environment, flooding, floods, Grant Foster, hydrology, likelihood-free, meteorological models, model-free forecasting, non-mechanistic modeling, non-parametric, non-parametric model, non-parametric statistics, numerical algorithms, precipitation, quantitative ecology, statistical dependence, statistical series, stream flow, Tamino, the bootstrap, time series, water vapor. Bookmark the permalink.