Rationalists, wearing square hats,
Think, in square rooms, 
Looking at the floor,
Looking at the ceiling.
They confine themselves
To right-angled triangles.
If they tried rhomboids, 
Cones, waving lines, ellipses--
As, for example, the ellipse of the half-
    moon--
Rationalists would wear sombreros.

-- Wallace Stevens, "Six Significant Landscapes"
Quote | Posted on by | Leave a comment

Isaias and Oak Island, NC

h/t to Professor Rob Young, Program for the Study of Developed Shorelines at Western Carolina University, via LinkedIn.

Posted in coastal communities, coastal investment risks, coasts, hurricanes, Robert Young, shorelines | Leave a comment

Got ya!

Non-Tesla drivers seem not to know (or care?) that because of the capability to do self-driving, all Teslas have a constellation of cameras all about the vehicle. These not only operate while driving, they also monitor activity about Teslas when they are parked, and record extra footage whenever anything approaches the car. They also have some, ahem, creative anti-theft devices built in.

Well, I was returning from a grocery pick up this afternoon in our Tesla Model 3, using the curbside pickup service of Roche Brothers in West Roxbury, MA. Along Route 109 where it crosses Interstate 95, I moved into the right lane, since in Westwood, the left lane drops off. I was not really intending to pass the vehicle I did on the right. It’s just the speed I was going and the speed it was going. However, a little drama ensued as you can see below. This was grabbed off the recording of the Tesla dashcam.

I annotated the license plate, but it zips by pretty fast, so I grabbed it.

Posted in Tesla | Leave a comment

“Inferring change points in the spread of COVID-19 reveals the effectiveness of interventions”

J. Dehning et al., Science 369, eabb9789 (2020). DOI: 10.1126/science.abb9789

Source code and data.

Note: This is not a classical approach to assessing strength of interventions using either counterfactuals or other kinds of causal inference. Accordingly, the argument for the effectiveness of interventions is weaker than it might be.

Posted in American Association for the Advancement of Science, American Statistical Association, Bayesian, Bayesian computational methods, causal inference, causation, changepoint detection, coronavirus, counterfactuals, COVID-19, epidemiology, SARS-CoV-2, state-space models, statistical series, time series | Leave a comment

Larkin Poe Interview

#larkinpoe

Posted in Larkin Poe | Leave a comment

Indiara Safir: “Amazing Grace”, “Carlos & Maria”, “Time”

Posted in Blues, Indiara Safir | Leave a comment

Political and demographic associations with changes in COVID-19 rates at the tails

A few days ago, I wrote a post about poltical and demographic associations with changes in COVID-19 rates over all U.S. counties. Today, I’m augmenting that. For here, rather than considering all counties, I limited the study to counties with demonstrably large increases or decreases in COVID-19 prevalence.

Rather than working with 3083 counties, this works with 742 of them. Essentially the cutoff was that the response calculated had to exceed 7.02 in magnitude:

Two regressions were repeated. The first was a standard linear regression of response versus all the remaining predictors. The improvement in R^{2} compared with the linear regression done with the complete dataset was remarkable: Adjusted R^{2} was 0.47 rather than 0.16. The second was the random forest regression. Here, too, the improvement in R^{2} was substantial: 0.71 versus 0.54 previously.

Here’s the result of the linear regression:

And, while the number of variables selected in the importance screening did not change, the ones having a consistent (monotonic) effect did. Those contributing to an increase in COVID-19 became:

  • otherpres16
  • otherhouse16
  • hispanic_pct
  • PerCapitaDollars
  • PctChgFrom2017

Those contributing to a decrease in COVID-19 became:

  • repsen16
  • rephouse16
  • black_pct
  • clf_unemploy_pct
  • lesshs_pct
  • lesshs_whites_pct

Note, for contrast, the random forests regression based upon all the counties had those variables contributing to an increase being:

  • otherpres16
  • otherhouse16
  • hispanic_pct
  • PerCapitaDollars
  • PctChgFrom2017

The random forests regression based upon all the counties had these variables:

  • demhouse16
  • black_pct
  • clf_unemploy_pct
  • lesshs_pct
  • lesshs_whites_pct
  • trump.obama.ratio

contributing to a decrease in COVID-19 prevalence.

The current choices make more sense, in addition to being more statistically notable because of the increased R^{2}. But increase in Republican Senate and House support associated with decrease in COVID-19? Hmmmm.

Posted in Five Thirty Eight, Tamino | Leave a comment

So, today, a diversion …

(Updated 2020-07-06, see end.)

So, today I report on a diversion. This is a bit off the beaten path for me. It resulted from an “I wonder” comment by Rob Honeycutt at Tamino‘s “Open Mind” blog, this time offering a post concerning how predictive the party affiliation of a governor of a state might be for the recent performance of the state with respect to COVID-19 (titled “COVID-19: Red States Blue States Redux”).

Rob supposed:

I would think running the numbers by county would be more accurate, though I guess that’s a much bigger task. CA, for example, is obviously predominantly blue, but there are a lot of small red counties which are now seeing a dramatic rise in cases.

And I replied:

Sounds like a logistic regression problem, although I’d/I’ll do it as a classification problem for a random forest. Anyone know where I can get a tabulation of counties vs Red or Blue status? Otherwise I’ll dig around on the Five Thirty Eight GitHub and see if they have one.

Well I didn’t find anything pertinent at Five Thirty Eight, and I didn’t do it as a classification problem, deciding that after the fact.

I found data sources.

One is the tried-and-true Johns Hopkins Github of COVID-19 cases and deaths which I have used elsewhere here. I’ve also discussed sources of COVID-19 data in general.

A second is the U.S. Bureau of Economic Analysis and, specifically, their county level personal income and demographics dataset.

And, rounding out the collection, there is MIT’s ElectionLab which offers datasets with demographics showing county level results through the 2018 General Elections. Their data dictionary is here. This provided the missing classification of Red vs Blue counties.

However, this also lets me tout a basic principle of good data science and statistics. This is something I learned from the books and lectures of Professor Frank Harrell, Jr of Vanderbilt University in Nashville, TN. In particular, Professor Harrell (and others) emphasize(s) the damage premature dichotomization can do to a statistical analysis.

So, instead of categorizing counties as Red vs Blue, or performance on COVID-19 as good vs bad, I am trying to push everything onto ordered measures and, preferably, onto a continuum measure. Indeed, the more I practice data science, the more I am convinced that, in the setup of experiments or data gathering or analysis, if categorical variables or yes/no variables can be at all avoided, they should be. Rather than measuring whether an Internet customer bought something or not, measure the time they spent considering it! Interpreting the duration of dwell on an item is qualitatively different if they buy it or if they don’t. In the former case, they were willing to devote that amount of time investigating. In the latter, it might be they are meticulous and decided against, or it could be the response was too slow and discouraged them. But even in the meticulous case, it shows they were almost to the point of a buy.

So, for assessing how well a county is doing for COVID-19 or not, I took the counts of cases from the county for the last 45 days. I fit a (penalized) smoothing spline to that. I used the smoothing spline to calculate a first derivative for all the days I could. I took the means of those derivatives. And I made that the score of how poorly or well the county was doing. If the mean was positive, the number of cases is going up, and it is doing poorly. If the mean is negative, cases are decreasing.

This all said, why am I clearly drifting my analysis into the world of, oh, politics?

While Five Thirty Eight‘s Nate Silver might claim politics hasn’t as much data available as does sports, a lot of political data exist and are well vetted in comparison to environmental or, as I cringe, Internet data. So it is, in a sense, a joy to work with political data. Most real Internet data, with which in my professional career I have been most deeply associated, needs to be collected by means outside the Internet, and there are interesting investigations and datasets available if that’s done.

Still, a lot of the counts and data pertaining to COVID-19 cases and deaths are fuzzy. We don’t necessarily have a reputable tally of sources in all states of, say, the United States, let alone counties, and let alone countries outside the United States. Assessing reputation depends upon knowing the commitment of authorities to presenting counts aligned with time, authorities which want to be transparent (some don’t: I know, that’s shocking), and knowing that rules for reporting are consistent or efforts are made to allow them to be made consistent even if they are changed. Political authorities tend, from my perspective and for the most part, want to do as little as possible and expend the least budget as possible to observe minimal reporting goals, if that.

The core idea was to develop a rich matrix of predictors or covariates consisting of demographic and other data pertaining to United States counties and match that with a response variable which expresses the overall (mean) slope of increases in COVID-19 cases. While I know counts of cases of COVID-19 aren’t the best measure, with case density per test or population estimates from random sampling being better, it is a widely available measure, and is the measure Tamino chose to use in his original post, namely cases per day per million population. Then, once matched, do a regression of response against predictors and try to develop a model which can be used for variable selection, that is, determining those covariates which are most predictive of increase in COVID-19 cases.

I chose to use a random forests modeling approach, on purely numerical outcome and predictors. The outcome was, as described above, taking the last 45 days of confirmed COVID-19 counts per county, and fitting them with a smoothing spline. Then the first (time) derivative of that spline was calculated per day of the last 45 days. Then the mean value of the first (time) derivative or slope was calculated. That was used as a response. However, the response was not calculated based upon raw counts of cases. Rather, the re-scaled or, as sometimes described, standardized value of case counts over time was used, lest the greater number of people in counties bias their experiences as being more representative. After all, the point is to get a per county picture.

Rescaling was done to covariates, too. In this case this had a beneficial effect: Some of the 3083(*) counties had missing values for some of covariates. Tossing them would have reduced the sample size and imposed a bias because quite a few counties were missing the value for at least one covariate. By rescaling, the mean of the rescaled values became zero, so missing values could be replaced by zeros, and this was equivalent to imputing copies of the mean for the missing values.

What were the covariates? They are described in the data dictionary, but I added a couple more. Here’s the complete list:

trump16
Presidential candidate Trump vote total in 2016.
clinton16
Presidential candidate Clinton vote total in 2016.
otherpres16
Presidential candidate vote total in 2016 for non-major party.
romney12
Presidential candidate Romney vote total in 2012.
obama12
Presidential candidate Obama vote total in 2012.
demsen16
Democratic Senate candidate vote total in 2016.
repsen16
Republican Senate candidate vote total in 2016.
othersen16
Non-major parties Senate vote total in 2016.
demhouse16
Democratic House candidate vote total in 2016.
rephouse16
Republican House candidate vote total in 2016.
otherhouse16
Non-major parties candidate vote total in 2016.
total_population
Total population as of 2016.
white_pct
Non-Hispanic whites as a percentage of total population, 2016.
black_pct
Non-Hispanic blacks as a percentage of total population, 2016.
hispanic_pct
Hispanics or Latinos as a percentage of total population, 2016.
nonwhite_pct
Non-whites as a percentage of total population, 2016.
foreignborn_pct
Foreign-born population as a percentage of total population.
female_pct
Females as a percentage of total population, 2016.
age29andunder_pct
Population 29 years or under as a percentage of total population, 2016.
age65andolder_pct
Population 65 years or older as a percentage of total population, 2016.
median_hh_inc
Median household income in the past 12 months (in 2016 inflation-adjusted dollars), 2016.
clf_unemploy_pct
Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
lesshs_pct
Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
lesscollege_pct
Population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
lesshs_whites_pct
White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
lesscollege_whites_pct
White population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
rural_pct
Rural population as a percentage of total population, 2016.
ruralurban_cc
Rural-urban continuum codes. See data dictionary for table of specific code values. These run as positive integers from 1 to 9, with 1 representing, for example, “Counties in metro areas of 1 million population or more” and 9 representing “Completely rural or less than 2,500 urban population, adjacent to a metro area”. These are not continuous values, but they are ordered.
trump.clinton.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county in 2016 for presidential candidate Trump to the sum of Clinton and non-major party candidates. Clinton votes are summed with non-majors because there are no votes in some counties for Clinton.
trump.obama.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012. Obama votes are summed with non-majors because there are no votes in some counties for Obama.
senate.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican Senate candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
house.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican House candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
PerCapitaDollars
This is from BEA government data. This is the sum of income of residents of the county divided by the number of residents in the county. See the documentation for methodology.
2018 data.
RankInState
This is from BEA government data. This is the rank of total personal income of residents of the county over all counties in the state to which they are part. Number 1 is the county with the biggest income. 2018 data.
PctChgFrom2017
This is the percent change in personal income of residents in a county for 2018 relative to 2017.
(*) Different sources have differing number of counties, and these need to be reconciled. Also, some sources, such as the BEA data, have sub-tabulations to regions and states. There are also errors. In the MIT data, Virginia for example is listed twice, with different values for, e.g., median household income. Finally, spellings of county names need to be reconciled. A few counties in one list are missing in others. 3083 is the number of common counties between the datasets.

Wright and Ziegler‘s ranger package of R was used to explore these data and perform a regression. Varying numbers of trees were tried, and the final result was based upon a forest of 40,000 trees. Two cores were allocated on a 3.2 GHz 4-core Windows 7 system. Up to 6 variables were considered for splitting at each node. The minimum node size was 5, meaning there had to be at least 5 cases passing through each node. The splitting rule was “extra trees”, based upon the option supporting

Geurts, P., Ernst, D., Wehenkel, L. (2006). Extremely randomized trees. Machine Learning 63:3-42. https://doi.org/10.1007/s10994-006-6226-1.

10 random splits were considered for each splitting variable.

There were actually two phases of random forest fitting performed. The first was done to ascertain covariate importance. The second was a regression. Both used 40,000 trees.

In the first phase, permutation importance was used, with importance scaled and local importance values calculated as in:

Breiman, L. (2001). Random forests. Machine Learning, 45:5-32. https://doi.org/10.1023/A:1010933404324.

Permutation importance was used because this was a regression application. The resulting normalized importance scores sorted in decreasing order of importance were:

normalized variable importance
demhouse16 black_pct othersen16 trump.obama.ratio demsen16 repsen16
0.077281 0.069349 0.056643 0.053579 0.051434 0.049935
senate.ratio otherhouse16 lesshs_whites_pct lesshs_pct hispanic_pct rephouse16
0.044154 0.041854 0.037017 0.035589 0.033906 0.032685
nonwhite_pct white_pct obama12 rural_pct clf_unemploy_pct PerCapitaDollars
0.030671 0.030221 0.028720 0.025665 0.025534 0.025463
ruralurban_cc RankInState clinton16 otherpres16 median_hh_inc trump16
0.024347 0.024095 0.023788 0.020571 0.020259 0.018410
foreignborn_pct total_population romney12 lesscollege_whites_pct lesscollege_pct age65andolder_pct
0.016596 0.016517 0.016443 0.015623 0.015530 0.012926
age29andunder_pct PctChgFrom2017 female_pct trump.clinton.ratio house.ratio
0.010310 0.008777 0.006107 0.000000 0.000000

Covariates below the 0.02 quantile were discarded from further consideration. This eliminated female_pct, trump.clinton.ratio, and house.ratio.

Regression was done with the remaining. The fit was barely adequate(**), with an R^{2} = 0.536, and a root-mean-squared error of 4.64. Here is the original mass density of response values:

Recall these were based upon slopes derived from centered and scaled case counts:


scaledDerivativesFrom<- function(setOfSeries)
{
# Expects cases to be in columns, and the series to be column vectors, that is,
# the successive time points are in successive rows.
nr<- nrow(setOfSeries)
nc<- ncol(setOfSeries)
tocks<- 1:nc
scaledSetOfSeries<- scale(setOfSeries)
centers<- attr(scaledSetOfSeries, "scaled:center")
scalings<- attr(scaledSetOfSeries, "scaled:scale")
fdFromSeries<- suppressWarnings(fdata(mdata=setOfSeries, argvals=tocks))
splining<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=0, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fst<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=1, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fstDerivatives<- colMeans(fst$data)
return(list(fst=fstDerivatives, ctr=centers, sca=scalings))
}

\text{fdata} and \text{fdata.deriv} are from the fda.usc package.

After this preparation, the point of regression was to determine which covariates contribute to increases in COVID-19 prevalence and which reduce it. In order to determine these, a baseline regression was performed with all rescaled covariates set to their mean values, that is, zeros. Then, because of the scaling, a variation in single standard deviation up or down was increasing or decreasing a covariate value by a unit. Accordingly, two predictions of response was done with all covariates zero apart from a single covariate of interest. All trees in the regression forest were used for these predictions. One prediction was done with the covariate at plus one-half and the other with the covariate at minus one-half. Responses were recorded in each case.

If these responses for each covariate were arranged as minus one-half, baseline, and plus one-half, there are two cases where the responses were monotonic, either monotonically increasing, or monotonically decreasing. If the responses increased monotonically, that meant the covariate was contributing to the increase in COVID-19 incidence. If the responses decreased monotonically, that meant the covariate was impeding growth in COVID-19. All other cases were interpreted as equivocal. A full listing of the responses is given below. There were a couple of covariates which had no significant effect upon response.

The code and datasets for this analysis are available here. I did not place these in my Github because some of the datasets and results (including the random forests) are pretty large.

(**) Slightly more than half of the variance was explained. This certainly is not good enough for prediction forecasting.

In the conclusion most of the covariates did not show a monotonic response. Of those which did, some had surprising effects upon response, contrary to what would be thought. In all cases, it shows that whatever is happening at state levels, as Tamino originally hypothesized, the story at county levels are far more nuanced.

According to the regression, the increases in the following covariates are associated with increases in COVID-19 incidence:

  • Number of votes for non-major party presidential candidates in 2016.
  • Number of votes for non-major party House candidates in 2016.
  • Hispanics or Latinos as a percentage of total population, 2016.
  • Personal per capita income, 2018.
  • Percent change in per capita income from 2017.

According to the regression, increases in the following covariates are associated with decreases in COVID-19 incidence:

  • Number of votes for Democratic party House candidates in 2016.
  • Non-Hispanic blacks as a percentage of total population, 2016.
  • Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
  • Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
  • White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
  • The ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012.

The remainder have non-monotonic effects upon response, with the exception of votes for Trump in 2016, votes for Romney in 2012, and the ratio of Trump votes to sum of Clinton and non-major party Presidential candidates in 2016. The latter three had no significant effect upon response.

These are difficult to interpret, certainly from any common wisdom conventional frame. A story could be told about propensity for supporting non-majority political parties as contrarians. A story could be told about wealth and especially recent wealth. And a more conventional story might be told about support for Democratic House candidates. But the analysis is too weak to build much of a case here. For why is county historical support for Trump over Obama and others associated with decreases in COVID-19?

Another analysis, perhaps using a Bayesian additive regression trees or a generalized additive model might reveal more. I have included the result of a straight regression at the bottom of the post.

The evidence indicates that factors contributing to increases in COVID-19 across counties are more complicated than state political trends suggest.

The table of predicted responses for all regressed covariates is:

decrease covariate baseline increase covariate
trump16 0.585751997130 0.63150294777 0.5393147721024
clinton16 0.636470821891 0.63150294777 0.9463412693325
otherpres16 0.035647289876 0.63150294777 1.2041816952557
romney12 0.559379728900 0.63150294777 0.5688396494840
obama12 0.743567621794 0.63150294777 0.9029509887327
demsen16 0.195284466985 0.63150294777 0.3834763150956
repsen16 0.563399591536 0.63150294777 -0.5634050367543
othersen16 0.482441423494 0.63150294777 -0.2858281838729
demhouse16 1.742885286145 0.63150294777 0.5681492070191
rephouse16 0.390443844733 0.63150294777 0.5296431047217
otherhouse16 0.277166428927 0.63150294777 0.6911570905665
total_population 0.485374266285 0.63150294777 0.4872532869708
white_pct 0.555369233414 0.63150294777 0.3409851865292
black_pct 1.458426115793 0.63150294777 -1.2545806208186
hispanic_pct 0.203970849739 0.63150294777 0.7626913277465
nonwhite_pct 0.306006353575 0.63150294777 0.5322087600115
foreignborn_pct 0.539705099022 0.63150294777 0.4336693730927
female_pct 0.610347729825 0.63150294777 0.4323089313331
age29andunder_pct 0.472453265562 0.63150294777 0.5049536737941
age65andolder_pct 0.403374548917 0.63150294777 0.5885353905113
median_hh_inc 0.456473898331 0.63150294777 0.6011229846262
clf_unemploy_pct 0.761125923726 0.63150294777 -0.0129693833895
lesshs_pct 0.770957215281 0.63150294777 -0.0098785432455
lesscollege_pct 0.556661896877 0.63150294777 0.3748009374647
lesshs_whites_pct 0.749246848692 0.63150294777 -0.1558774570007
lesscollege_whites_pct 0.521495353443 0.63150294777 0.4123318092143
rural_pct 0.602826246670 0.63150294777 0.2588820550810
ruralurban_cc 0.297837745133 0.63150294777 0.3138831021115
trump.clinton.ratio 0.631502947772 0.63150294777 0.6315029477723
trump.obama.ratio 1.037253853907 0.63150294777 0.2651589870262
senate.ratio -0.103308756239 0.63150294777 -0.0583507114855
house.ratio 0.631502947772 0.63150294777 0.6315029477723
PerCapitaDollars -0.111339953111 0.63150294777 0.7183689921657
RankInState 0.439511214174 0.63150294777 0.5414556938257
PctChgFrom2017 -0.525939132513 0.63150294777 0.6810736843380

The following is a result of a linear regression with intercept. Note that the R^{2} is poorer than with the random forest.

Update, 2020-07-06

In the assessment above, I mentioned using BARTs and GAMs as tools for investigating why

  • Number of votes for non-major party presidential candidates in 2016.
  • Number of votes for non-major party House candidates in 2016.
  • Hispanics or Latinos as a percentage of total population, 2016.
  • Personal per capita income, 2018.
  • Percent change in per capita income from 2017.

are associated with increase in COVID-19. After thinking on it for a second day, when I devote time to this problem again, I may instead look at determining the causal effect, if any, of these factors upon increases in COVID-19 incidence, perhaps using the methods of propensity scoring or others. In particular, there are methods in the R package TWANG.

Posted in Five Thirty Eight, Tamino | 1 Comment

Calculating Derivatives from Random Forests

(Edits to repair smudges, 2020-06-28, about 0945 EDT. Closing comment, 2020-06-30, 1450 EDT.)

There are lots of ways of learning about mathematical constructs, even about actual machines. One way is to push them into a place of improbable application, asking of them things which seem implausible, and watch how they fail. Indeed, they might not fail.

Many of the engineers from the NASA Apollo era spent their spare time with simulators of the CSM and LEM to develop a “feel” for what these machines could do and when they got into trouble. This allowed them to understand their capabilities beyond their formal specifications. This is quite possibly why Jack Garman was able to give the go-ahead for the Apollo 11 lunar landing, given a section of code was patched out, despite the overload of the LEM’s guidance computer from its rendezvous radar being set erroneously on in addition to the doppler landing radar.

Random forests were originally devised by Kleinberg and Ho (1990-2000), and then extended by Breiman, Cutler, Amit, and Geman. Breiman and Cutler are most frequently associated with these important statistical learning devices. They can be used for classification, regression, and variable selection. They can also be used for clustering.

Regression with random forests can be done on the same data that linear or generalized linear regression are done. Whereas the latter two means of calculation offer some continuity conditions which justify how one might use the outputs of their models to calculate, say, derivatives of responses with respect to predictors, it is less obvious how this might be done with random forests.

So, to set the stage, we begin with an application of random forests to time series. In fact, the series in question is reasonably famous, being the Air Passengers dataset introduced by Box & Jenkins and featured in the R package called forecast which, in its most recent rendition, offers the clever function \text{auto.arima} as a means of automatically building ARIMA models for such series.

The plan is to regress with \text{auto.arima} and regress with random forests, and see how the results compare. Then the next step is to figure out how to calculate derivatives from random forest predictions and then uncertainties in those, specifically prediction intervals. (See also on prediction intervals.)

Some preliminaries ….

Since the R forecast package has been mentioned, the ranger package should be introduced as the one which will be used to train and test random forests, where, for those unfamiliar, training is a calibration process. Other packages have elements which are critical to this development, namely, the random, coda, and fda.usc packages.

And, incidentally, the code and some supporting files from this blog post are available here.

How can anything be done with a set of dependent data like a time series using random forests? After all, random forest ensemble methods do not in themselves have any inherent applicability to dependent data. Response-covariates sets are considered independent of one another. But, what can be done is mimic the ARIMA setup. Define a set of covariates for a response y_{t} in a series to be the union of the some subset of lagged values of the response, such as \{y_{t-1}, y_{t-2}, \dots, y_{t-p}\}, with some subset of lagged moving averages of the response. These are, in fact, for ARIMA, the basic features of the series which are used, along with some spectral theory, to produce the methods ARIMA employs.

In other words, the matrix of candidate covariates for responses looks like, for p > q:

\left[ \begin{array} {cccccccccc} y_{p} & y_{p-1} & y_{p-2} & \dots & y_{1} & a_{p,2} & a_{p,3} & \dots & a_{p,q)} & \tilde{\mu} \\ y_{1+p} & y_{p} & y_{p-1} & \dots & y_{2} & a_{(1+p),2} & a_{(1+p),3} & \dots & a_{(1+p),q)} & \tilde{\mu} \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ y_{n-1} & y_{n-2} & y_{n-3} & \dots & y_{n-p} & a_{(n-1),2} & a_{(n-1),3} & \dots & a_{(n-1),q} & \tilde{\mu} \\ y_{n} & y_{n-1} & y_{n-2} & \dots & y_{1+n-p} & a_{n,2} & a_{n,3} & \dots & a_{n,q} & \tilde{\mu} \\ \end{array} \right]

a_{t,2} = \frac{y_{t-1} + y_{t-2}}{2} \\[0.5cm] a_{t,3} = \frac{y_{t-1} + y_{t-2} + y_{t-3}}{3} \\[0.5cm] \vdots \\[0.5cm] a_{t,q} = \frac{y_{t-1} + y_{t-2} + \dots + y_{1+t-q} + y_{t-q}}{q}

The y_{t-\ell} terms are the vector of p responses corresponding to response y_{t} but lagged by \ell time steps. The a_{t,j} are averages of response values from y_{t-1} through y_{t-j}. \tilde{\mu} is the median of y_{1}, y_{2}, \dots, y_{n}[2].

Why not use ARIMA? Any parametric method brings assumptions to a problem and the point of non-parametric methods, even modern ones like random forests, is to relax the assumptions that otherwise might be needed for estimation and prediction without costing too much. With random forests, there are ancillary benefits, like variable selection. Specifically, rather than choosing an order based upon autocorrelations and the like, as is the practice with random forests work, big a high number of lags and moving averages and let preliminary grows of a random forest indicate which are important for explaining the series.

It’s to be expected there will be drawbacks.

  • The amount of computation required will be higher, because of the need for sampling and resampling for constructing trees.
  • Because a large number of lags are needed to be considered, at least before variable selection is performed, the first parts of series cannot be analyzed as thoroughly as the remainders. More might be analyzed once variable selection indicates that, at least for the tail of the series, some older points of the series do not inform the present.
  • A random forest model may be able to forecast, but it’s capability depends upon many complicated relationships between the in-sample (or training) portion of a series and the out-of-sample (or test) portion. To the degree the ARIMA framework is accepted and, I daresay, works, the premises of the in-sample portion by assumption extend into the out-of-sample portion, and so forecasting can work and be powerful. Should they be violated, however, the ARIMA forecast may be useless because some kind of changepoint is encountered, even if in most systems for which ARIMA works, such violent changes are unlikely. If they are unlikely, there should be some relationship between the character of the in-sample with the out-of-sample. The degree to which these same considerations extend to use of random forests remains, to my knowledge, to be explored. Nonetheless, tuning random forests for out-of-sample performance has produced some notable successes[1].

Nevertheless, it’s easy to see how random forest approaches to analyzing series might be useful. They are insensitive to natures of distributions, they are flexible to use, and, in contrast with some other statistical learning methods, comparatively transparent. Forests trained on different datasets can be merged and consolidated.

[1] See:

  • G. Dudek, “Short-term load forecasting using random forests”, Intelligent Systems 2014., Springer, Cham, 2015, 821-828.
  • A. Lahouar, J. Ben Hadj Slama, “Hour-ahead wind power forecast based on random forests”, Renewable Energy, 2017, 109, 529-541.
  • W. Y. N. Naing, Z. Z. Htike. “Forecasting of monthly temperature variations using random forests”, APRN J. Eng. Appl. Sci, 2015, 10, 10109-10112.

[2] Because \tilde{\mu} is constant for any given series, it won’t be predictive in any fashion so could have been left out of the covariates matrix. However, were the forest trained on additional series, having differing medians might become predictive. Note that this implies both variable selection and regression with a reduced variable set need to be done on new data series. Note as well merging of forests may have to be done with the full original parent set of variables rather than with the reduced set in that case.

Right out of the gate, here’s how one random forest compares with the result from \text{auto.arima}:

(To see larger image, click on figure. It will open in a new window.)

The quality is entirely comparable to the ARIMA result. Note a check of the lags and moving averages picked by \text{auto.arima} shows little overlap with what the random forest elects. What that means for the interpretability of Air Passengers behavior for one or the other of the techniques is beyond the scope of this blog post, but it poses an interesting scientific question.

The forest was produced as a result of a seven step process. Note that prior to building forests or any other step, a replica of the original series is made which is centered and scaled. In other words, the mean, \mu_{y}, and standard deviation, \sigma_{y}, of the original series is calculated, and a transformed version of the series \hat{y}_{t} = \frac{y_{t} - \mu_{y}}{\sigma_{y}} is used in place of the original. This recentering is standard practice for good statistical and numerical properties. Results coming out of the model can be transformed back into the original measure by multiplying them by \sigma_{y} and adding \mu_{y} to the result.

First, a forest of 100,000 trees was grown to ascertain importance of predictors using permutation importance measures. Other details will be given below[3].

Second, a forest of 100,000 trees was grown to do the regression using the selected set of variables, and resulting in the figure above. The resulting R^{2} = 0.996 and the mean-squared-error was 0.0029.

Third, the forest produced in the second step is used to make predictions for all the time points observed. These are used to produce a baseline series for the estimate of Air Passengers over time, and the estimates for that quantity’s first and second time derivatives.

Fourth, the forest produced in the second step is used to make predictions using each of the trees in the forest for all the time points observed. These are used to develop prediction errors in estimates of quantity, and its first and second time derivatives.

Fifth, whenever derivatives need to be estimated, the B-splines technique in the fda.usc package is used.

Sixth, uncertainty estimates in quantity, and its first and second time derivatives are obtained by bootstrapping the full set of individual predictions in the third step above. Each bootstrap of the (here) 200 bootstrap samples results in 100,000 time series predictions, one for a tree in the forest. (Because bootstrapping entails sampling with replacement the same series might be elected multiple times.) The value of the k^{\text{th}} time point in the resulting series is the mean of all 100,000 k^{\text{th}} points in the bootstrap sample. fda.usc splines are used to obtain smooth quantities, and first and second time derivatives from each of the bootstrap samples. To obtain prediction intervals for each of the time points in the series, the built-in R \text{lm} and \text{predict.lm} functions are used to produce them, using the 200 estimates at each time point.

Seventh, results are compiled and plotted.

[3] The number of variables to possibly split at each node was \lceil \sqrt{m}\,\rceil, where m is the number of predictor variables. Initially that was 108 predictors, including 51 lagged responses, 1-51, 53 moving averages, for training windows of sizes 2 to 54, and the median value for the entire series. So, initially, \lceil \sqrt{m}\,\rceil = 11. After variable selection, with normalized variable importance scores having to be greater than 0.005, there was 31 variables used for regression. These included 7 lagged responses and 24 moving averages. Lags 1, 2, 3, and 33-38 were considered important by this criterion. The moving average windows of lengths 2-11, and 38-48 were considered important. This means for the regression, \lceil \sqrt{m}\,\rceil = 6.

The minimum node size specified was 5. The “extra trees” splitting rule was used for both variable selection and regression. Up to 10 random trees per predictor were considered. The depth of trees was unbounded. Inbag errors were kept. 100,000 trees were generated for each forest.

(To see larger image, click on figure. It will open in a new window.)

The uncertainties about the predicted trace are difficult to see because they are small. Recall these are prediction errors. To give an idea, the next figure shows a portion of the same with a lens distortion applied:

(To see larger image, click on figure. It will open in a new window.)

Estimates of first and second time derivatives show similarly small prediction error clouds. Note how the first and second derivatives line up with changes in the number of Air Passengers over time in the figure below.

(To see larger image, click on figure. It will open in a new window.)

The following shows a close-up of the same for a limited interval.

(To see larger image, click on figure. It will open in a new window.)

In short, I conclude this technique to be a highly successful approach to estimating derivatives from random forests, even if it is limited to analyzing given series, and not forecasts.

I have changed my mind regarding that last paragraph. The perspective inherent in the regression forest approach above is atypical for many applications of random forests or, for that matter, statistical learning methods. There’s less emphasis upon prediction as a goal, and more on simply characterizing a given dataset. Accordingly, performance of a forest out-of-sample is less important than fidelity to a series of responses in sample. In this respect, random forests are being used very much like spline interpolators and, so, serve as the basis for calculating other numerical aspects of them, such as derivatives.

Also notable is the idea of bootstrapping the forest in order to get a sense of the internal variability in the model. After all, growing the forest from a given dataset does not produce a unique forest. Successive generations of the random forest will differ. Bootstrapping the forest gives some idea how it may have turned out a second time.

This consideration leads to a couple of other questions appropriate for investigation:

  1. What is the equivalence class of forests look like which all reproduce a given series with less than some specified mean-squared-error?
  2. Given such a class, what is the set of series which are classified by members of the class as being close enough to the original series that the forests will reproduce them with less than some specified mean-squared-error, perhaps the same as the m.s.e. associated with the class, or perhaps larger?
Posted in bridge to somewhere, Calculus, dependent data, dynamic generalized linear models, dynamical systems, ensemble methods, ensemble models, filtering, forecasting, hierarchical clustering, linear regression, model-free forecasting, Monte Carlo Statistical Methods, non-mechanistic modeling, non-parametric model, non-parametric statistics, numerical algorithms, prediction, R statistical programming language, random forests, regression, sampling, splines, statistical learning, statistical series, statistics, time derivatives, time series | Leave a comment

Great podcast: “Confronting uncertainty with Tamsin Edwards”

Dr Tamsin Edwards visits Professor David Spiegelhalter on his “Risky Talk” podcast.

Dr Edwards is a climate scientist with the title Senior Lecturer in Physical Geography at Kings College, London.

There’s much good talk about climate and its associated uncertainties, communicating uncertainty in science, and dealing with uncertainty relating to health in your personal life.

Dr Edwards also writes a blog, All Models Are Wrong.

Posted in alternatives to the Green New Deal, American Association for the Advancement of Science, climate change, climate denial, climate education, climate policy, climate science, David Spiegelhalter, dynamical systems, fluid dynamics, games of chance, global warming, global weirding, IPCC, model comparison, risk, Risky Talk, statistical models, statistical series | Leave a comment

Larkin Poe

Awesome.

#larkinpoe

Posted in Larkin Poe | Leave a comment

Listen!

Link | Posted on by | Leave a comment

COVID-19 statistics, a caveat : Sources of data matter

There are a number of sources of COVID-19-related demographics, cases, deaths, numbers testing positive, numbers recovered, and numbers testing negative available. Many of these are not consistent with one another. One could hope at least rates would be consistent, but in a moment I’ll show evidence that even those cannot be counted.

The Johns Hopkins Coronavirus Tracker site is a major project displaying COVID-19-related statistics. They also put their datasets and series on a Github site, which is where I draw data for presentation and analyses here. I do that because I’ve read and understood their approach and methodology, and I’m sticking by the same source over time.

There are other sources of data, including, for example, a Github site operated by the New York Times, and the COVID Tracking Project which is operated in association with Atlantic Magazine. The COVID Tracking Project, as far as I know, does not have a Github, but they offer a data portal.

Here I’m going to compare the series of counts of deaths in the United States from the Johns Hopkins Coronavirus Track and those from the COVID Tracking Project. The two don’t have completely consistent coverage: Johns Hopkins reports lag those reported by COVID Tracking, probably because, as I’ve heard in interviews with some of the principals, they want to check and double-check the counts before releasing them. I imagine all data sites undergo some vetting of counts.

But other complexities intervene. There are questions about whose death statistics does a site want to count? Reports from state governments, results released to the U.S. National Center for Health Statistics by coroners, independent assessments from universities, etc. Do you count presumed COVID-19 deaths, or are all deaths a subset of those actually tested and confirmed positive? Do you get counts from the U.S. Centers for Disease Control?

There are no completely correct ways here. The choices are different, and some may be better than others, and, worse, some of the choices may be better than others at different times. Counting deaths from a pandemic, like doing the U.S. Census, can be a politically charged proposition, if not at the federal level, then in U.S. states. Cities might differ in their estimates of their own reported deaths than their states, perhaps because of different counting rules or perhaps because they are using the counts to manage the pandemic and they want them to reflect a local character of their population.

There are also differences in reporting times. It’s long been known that Tuesdays seem to have a blip of deaths cases, and weekends have a paucity, but that cannot be real. It is an artefact of how the tabulation of deaths is done.

But I was surprised that not only does Johns Hopkins differ from the COVID Tracking Project, the disparity grows over time.

So this shows that the COVID Tracking Project exhibits a growing deficit in deaths versus Johns Hopkins over time. But a closer look not only reveals a deficit, it reveals dynamics. Here’s a time plot of Johns Hopkins U.S. death counts less COVID Tracking Project:

A deficit I can understand, but wiggles and bumps are harder to grok.

So in addition to debates about quality of tests, and whether there are excess deaths whose causes were misdiagnosed, there appears to be a major problem with data quality. That’s not to say that any source is wrong, simply that there are few standards on what should be counted how, and how states should count basic things, like cause of death. This not only has implications for COVID-19 policy, but if there is such variation here, one needs to wonder how good standard actuarial determination from the government are, when they rely upon local sources for basic determinations?

There are ways to do this kind of thing when many sources are used. These include weighting sources depending upon long term performance, a bit like Five Thirty Eight does to political opinion polls of Presidential popularity. The other thing to do, of course, would be to do stratified sampling or cluster sampling estimate some of these quantities, or something else (or something else).

Most importantly, it seems key to me that if inferences and forecasts are to be made with these data, the latent number of deaths per unit time needs to be considered a latent function sampled by these complicated processes, and it needs to be recovered first. No doubt that will be done with error, but large scale week-to-week policy should certainly not be predicated upon considering tallies and counts of these important statistics as literally accurate.

They are compiled by well-meaning, intelligent and educated people, but they are, it seems, under the exigencies of this pandemic, very dirty. There’s a lot here going on which isn’t neatly categorized as one of the usual problems seen in count data regression.

Posted in coronavirus, count data regression, COVID-19, descriptive statistics, epidemiology, pandemic, policy metrics, politics, population biology, population dynamics, quantitative biology, quantitative ecology, sampling, SARS-CoV-2, statistical ecology, statistical series, statistics | 1 Comment

First substantial mechanism for long term immunity from SARS-CoV-2 : T-cells

M. Leslie, “T cells found in COVID-19 patients ‘bode well’ for long-term immunity“, Science, doi:10.1126/science.abc8120.

A. Grifoni, et al, “Targets of T cell responses to SARS-CoV-2 coronavirus in humans with COVID-19 disease and unexposed individuals“, Cell, 14th May 2020.

J. Braun, L. Loyal, et al, “Presence of SARS-CoV-2 reactive T cells in COVID-19 patients and healthy donors“, medRχiv (not peer-reviewed, with comments), 22 April 2020.

34% of non-CV19 patients in Germany already had the Coronavirus spike protein specific helper T-cells. Due most likely to infection with prior common cold coronaviruses. That number is close to the percentage of “asymptomatic cases” we are seeing.

Posted in American Association for the Advancement of Science, coronavirus, COVID-19, epidemiology, SARS-CoV-2 | Tagged | Leave a comment

Larkin Poe: “Southern Cross” cover, from CSNY (and extras)

(Theme from Wallender.)

Posted in Larkin Poe | Leave a comment

Dissection of the Dr Judy Mikovits’ claims in AAAS Science

https://www.sciencemag.org/news/2020/05/fact-checking-judy-mikovits-controversial-virologist-attacking-anthony-fauci-viral

h/t Dr Katharine Hayhoe @LinkedIn

The Chronic Fatigue Syndrome retraction notice.

Excerpt:

Science asked Mikovits for an interview for this article. She responded by sending an empty email with, as attachments, a copy of her new book and a PowerPoint of a 2019 presentation titled “Persecution and Coverup.”

Below are some of the video’s main claims and allegations, along with the facts.

Interviewer: Dr. Judy Mikovits has been called one of the most accomplished scientists of her generation.

Mikovits had authored 40 scientific papers and wasn’t widely known in the scientific community before she published the 2009 Science paper claiming a link between a new retrovirus and CFS. The paper was later proven erroneous and retracted.

Interviewer: Her 1991 doctoral thesis revolutionized the treatment of HIV/AIDS.

Mikovits’s Ph.D. thesis, “Negative Regulation of HIV Expression in Monocytes,” had no discernible impact on the treatment of HIV/AIDS.

Interviewer: At the height of her career, Dr. Mikovits published a blockbuster article in the journal Science. The controversial article sent shock waves through the scientific community, as it revealed that the common use of animal and human fetal tissues was unleashing devastating plagues of chronic diseases.

The paper revealed nothing of the sort; it only claimed to show a link between one condition, CFS, and a mouse retrovirus.

.
.
.

Mikovits: Heads of our entire HHS [Department of Health and Human Services] colluded and destroyed my reputation and the Department of Justice and the [Federal Bureau of Investigation] sat on it, and kept that case under seal.

Mikovits has presented no direct evidence that HHS heads colluded against her.

Mikovits: [Fauci] directed the cover-up. And in fact, everybody else was paid off, and paid off big time, millions of dollars in funding from Tony Fauci and … the National Institute of Allergy and Infectious Diseases. These investigators that committed the fraud, continue to this day to be paid big time by the NIAID.

It’s not clear which fraud and what cover-up Mikovits is talking about exactly. There is no evidence that Fauci was involved in a cover-up or that anyone was paid off with funding from him or his institute. No one has been charged with fraud in relation to Mikovits’s allegations.

Mikovits: It started really when I was 25 years old, and I was part of the team that isolated HIV from the saliva and blood of the patients from France where [virologist Luc] Montagnier had originally isolated the virus. … Fauci holds up the publication of the paper for several months while Robert Gallo writes his own paper and takes all the credit, and of course patents are involved. This delay of the confirmation, you know, literally led to spreading the virus around, you know, killing millions.

At the time of HIV’s discovery, Mikovits was a lab technician in Francis Ruscetti’s lab at NCI and had yet to receive her Ph.D. There is no evidence that she was part of the team that first isolated the virus. Her first published paper, co-authored with Ruscetti, was on HIV and published in May 1986, 2 years after Science published four landmark papers that linked HIV (then called HTLV-III by Gallo’s lab) to AIDS. Ruscetti’s first paper on HIV appeared in August 1985. There is no evidence that Fauci held up either paper or that this led to the death of millions.

Interviewer: If we activate mandatory vaccines globally, I imagine these people stand to make hundreds of billions of dollars that own the vaccines.

Mikovits: And they’ll kill millions, as they already have with their vaccines. There is no vaccine currently on the schedule for any RNA virus that works.

Vaccines have not killed millions; they have saved millions of lives. Many vaccines that work against RNA viruses are on the market, including for influenza, measles, mumps, rubella, rabies, yellow fever, and Ebola.

.
.
.

Interviewer: Do you believe that this virus [SARS-CoV-2] was created in the laboratory?

Mikovits: I wouldn’t use the word created. But you can’t say naturally occurring if it was by way of the laboratory. So it’s very clear this virus was manipulated. This family of viruses was manipulated and studied in a laboratory where the animals were taken into the laboratory, and this is what was released, whether deliberate or not. That cannot be naturally occurring. Somebody didn’t go to a market, get a bat, the virus didn’t jump directly to humans. That’s not how it works. That’s accelerated viral evolution. If it was a natural occurrence, it would take up to 800 years to occur.

Scientific estimates suggest the closest virus to SARS-CoV-2, the virus that causes COVID-19, is a bat coronavirus identified by the Wuhan Institute of Virology (WIV). Its “distance” in evolutionary time to SARS-CoV-2 is about 20 to 80 years. There is no evidence this bat virus was manipulated.

Interviewer: And do you have any ideas of where this occurred?

Mikovits: Oh yeah, I’m sure it occurred between the North Carolina laboratories, Fort Detrick, the U.S. Army Medical Research Institute of Infectious Diseases, and the Wuhan laboratory.

There is no evidence that SARS-CoV-2 originated at WIV. NIAID’s funding of a U.S. group that works with the Wuhan lab has been stopped, which outraged many scientists.

Mikovits: Italy has a very old population. They’re very sick with inflammatory disorders. They got at the beginning of 2019 an untested new form of influenza vaccine that had four different strains of influenza, including the highly pathogenic H1N1. That vaccine was grown in a cell line, a dog cell line. Dogs have lots of coronaviruses.

There is no evidence that links any influenza vaccine, or a dog coronavirus, to Italy’s COVID-19 epidemic.

Mikovits: Wearing the mask literally activates your own virus. You’re getting sick from your own reactivated coronavirus expressions, and if it happens to be SARS-CoV-2, then you’ve got a big problem.

It’s not clear what Mikovits means by “coronavirus expressions.” There is no evidence that wearing a mask can activate viruses and make people sick.

Mikovits: Why would you close the beach? You’ve got sequences in the soil, in the sand. You’ve got healing microbes in the ocean in the salt water. That’s insanity.

It’s not clear what Mikovits means by sand or soil “sequences.” There is no evidence that microbes in the ocean can heal COVID-19 patients.

See the remainder of the article at Science.

Posted in American Association for the Advancement of Science, coronavirus, COVID-19, SARS-CoV-2, science, Science magazine | 3 Comments

We are all Mathematicians

Bobby Seagull.

Great.

Posted in mathematics, mathematics education, maths | Leave a comment

“There’s mourning in America”

We are Republicans and we want Trump defeated.”

And the Orange Mango apparently hates this advert.

And that’s why it’s here.

The Lincoln Project apparently introduced this advert on Twitter with the explanatory text:

Since you are awake and trolling the internet, @realDonaldTrump, here is a little bedtime story just for you … good night, Mr. President.#Sleeptight #CountryOverParty

Posted in statistics | Leave a comment

“Seasonality of COVID-19, Other Coronaviruses, and Influenza” (from Radford Neal’s blog)

Thorough review with documentation and technical criticism of claims of COVID-19 seasonality or its lack. Whichever way this comes down, the links are well worth the visit!

Will the incidence of COVID-19 decrease in the summer? There is reason to hope that it will, since in temperate climates influenza and the four coronaviruses that are among the causes of the “common cold” do follow a seasonal pattern, with many fewer cases in the summer. If COVID-19 is affected by season, this would […]

via Seasonality of COVID-19, Other Coronaviruses, and Influenza — Radford Neal’s blog

Quote | Posted on by | Leave a comment

Phase plane plots of COVID-19 deaths

There are many ways of presenting analytical summaries of new series data for which the underlying mechanisms are incompletely understood. With respect to series describing the COVID-19 pandemic, Tamino has used piecewise linear models. I have mentioned how I prefered penalized (regression) splines. I intended to illustrate something comparable with what Tamino does (see also), but, then, I thought I could do both that and expand the discussion to include a kind of presentation used in functional data analysis, namely phase plane plots. Be sure to visit that last link for an illustration of what the curves below mean.

Here I’ll look at various series describing regional characteristics of the COVID-19 pandemic. These include counts of deaths, counts of cases, and counts of number of recovered people. My data source is exclusively the Johns Hopkins Coronavirus Resource Center and, in particular, their Github repository. I have also examined a similar repository maintained by The New York Times, but I find it not as well curated, particularly when it admits the most recent data, data which sometimes exhibits wild swings.

That said, and while curation is important, as well as data validation, no organization can do much when the reporting agencies do not provide counts uniformly or backfill counts to their proper dates. Thus, if a large count of deaths from COVID-19 is identified, the proper procedure is to tag each death with an estimate of the day of death, and correct figures for deaths on those days. Instead, some government agencies appear to dump the discovered counts in all on the present day. There is evidence as well that some agencies hold on to counts until they reach some number, and then release the counts all at once. These are important because this sampling or reporting policy will manifest as changes in dynamics of the disease when, in fact, it is nothing of the kind.

I’m not saying this kind of distortion is deliberate, although, in some cases, it could be. (We cannot tell if it’s deliberate evasion or bureaucratic rigidity, or simply they-don’t-care-as-long-as-they-report.) It is indicative of overworked, fatigued demographers, epidemiologists, and public health professionals making a best effort to provide accurate counts. Johns Hopkins makes an effort to try to resolve some of these changes. But it cannot do everything.

This idea of having real time reporting is an essential part of mounting a proper pandemic response and is, at least in the United States, another thing for which we were woefully underprepared. Without it, even the highest government officials are flying low to the ground with a fogged up window, so to speak.

Nevertheless, I have accepted the Johns Hopkins data as they are and analyzed it in the manner described below. I’ll comment about some features the plots show, some of which pertains to hints about how data is collected and reported.

I’m very much in favor of avoiding use of absolute counts. We don’t know how comprehensive those are, so, I prefer to look at rates of change. Professor David Spiegelhalter discusses why rates are a good idea in connection with COVID-19 reporting.

Note that the data used only goes through the 29th of April 2020.

Approach

The thing about track counts of death is that, even if they are complete, which in the middle of a crisis they seldom are, these counts don’t tell you much. What do you want to know? Ultimately, you want to know how effective a set of policies are behaving in order to curtail deaths due to COVID-19. So, for instance, if the cumulative counts of deaths for Germany are considered:

this doesn’t really say a lot about what’s driving the shape of the curve. A phase plane plot is constructed by calculating good estimates of the time series first and second (time) derivatives at each point, and then plotting the second derivative against the first. The first derivative is the rate at which people are dying, and the second derivative is the change per unit time in that rate, just like acceleration or deceleration of a car, where the second time derivative being positive means acceleration, and it negative negative means deceleration.

From the perspective of policy, if the rate is appreciably positive, you want policy measures which decelerate, driving that rate down. A win is when both first and second derivatives are near zero. So, for Germany,

An interpretation is that Germany began taking the matter seriously about 25th-28th March, and activated a bunch of measures on the 31st of March, and then has struggled to decelerate rates of death. The loops denote oscillations where a policy is implemented and, for some reason, there’s a reaction, whether in government, or public, which accelerates the rate again. This tug of war has, on average, kept the rate of death about 200 deaths a day, but it’s not going down.

A caution here, however: These dates of action should be interpreted with care. A death occurs about 10-14 days after infection. So, if a policy action is taken, the effects of that action won’t be seen for 2-3 weeks. So, rather than saying Germany took it seriously 25th March, I should have said 3rd of March, and actions were implemented about the 10th of March.

Can it be managed as I say, driving both acceleration and rates to zero? Sure, consider Taiwan:

The disease got a little away from them in early March, and then again but to a smaller degree in early April, but now it’s controlled. Note, however, that the number of cases per day was never permitted to get above twenty.

A Review of Some Countries

So, what about the United States? It’s actually not too bad, except that it’s clear control measures have not been anywhere as stringent as the couple of countries already mentioned. There are no loops:

Sure, it’s good the rate is decelerating, but it’s not decelerating by much: Less than 50 deaths per day, per day. That’s when the rate is just under 2000 deaths per day.

What about the biggest contributor, New York State?

It’s clear New York is really struggling to get control of the pandemic, holding the rate to about 700 deaths a day, but it was accelerating again as of reporting on the 29th of April. Those cycles mean action, however.

What about Sweden? They’ve been touted as not having any lockdowns. I’d say Sweden is in trouble:

But that judgment rests on basically just a couple of datapoints. Perhaps they are askew, and the sharp upwards isn’t real. After all, they seem to have managed to keep the rate of death to about 70 per day on average, with a wide scatter.

The Scene from (Some of) the States

Finally, let’s look at some states in the United States.

Consider Michigan, for instance, scene of much conflict over the lockdown measures their governor has taken:

Michigan has struggled as well, but whatever was done about 10th April or so really slammed the brakes on numbers of deaths.

Florida is interesting because although the number of deaths is (reported as) 50 per day, there is little evidence of acceleration or deceleration.

They have 1200 deaths. An interpretation is that it’s still early in Florida. Another is that all the deaths have not been reported yet. Another interpretation is that for whatever reason people dying of COVID-19 are being given a cause-of-death which is something else. There are a lot of elderly people in Florida. Perhaps a comparison of overall mortality rates there with historical would be advised. Note also that the overall shape of the acceleration vs rate of death in Florida is similar to the United States at large, although the magnitudes are different.

Finally, consider Georgia, another state where policy on managing COVID-19 has been contentious:

Despite the contention, it’s clear Georgia has been doing something effective to contain the disease. The trouble is that the latest data suggests it is beginning to get away from them, although it’s early to say if that will continue, or they will find a way to bring it back.

Update, 3rdMay 2020

I will be updating these results regularly. I also intend to drape a varying uncertainty area calculated from uncertainties in estimating acceleration and velocity.

The R code used to generate these is still in progress, but when it matures a bit I will make it publicly available.

Posted in coronavirus, COVID-19, functional data analysis, pandemic, penalized spline regression, phase plane plot, SARS-CoV-2, splines | 12 Comments

A SimCity for the Climate

SimCity is/was a classic simulation game teaching basics of public policy, energy management, and environmental regulation. My kids played it a lot. Heck, I played it a lot.

Now, Climate Interactive, Tom Fiddaman of Ventana Systems, Prof John Sterman of MIT Sloan, and Prof Juliette Rooney-Varga of UMass Lowell’s Climate Change Initiative have teamed up to produce En-Roads, a climate change solutions simulator.

It’s pretty involved.

In addition, Bloomberg Green has adapted the simulator with a simpler menu interface to allow you to try some things yourself.

See more about the basic idea.

And check out Bloomberg Green‘s climate data dashboard.

Posted in Bloomberg, Bloomberg Green, climate activism, climate business, climate change, climate data, climate disruption, climate economics, climate education, Climate Interactive, climate models, climate policy, global warming, science | Leave a comment

Simplistic and Dangerous Models

Nice to see Generalized Additive Models used.

Musings on Quantitative Palaeoecology

A few weeks ago there were none. Three weeks ago, with an entirely inadequate search strategy, ten cases were found. Last Saturday there were 43! With three inaccurate data points, there is enough information to fit an exponential curve: the prevalence is doubling every seven days. Armchair epidemiologists should start worrying that by Christmas there will be 1012 preprints relating COVID-19 to weather and climate unless an antidote is found.

Fortunately, a first version of an antidote to one form of the preprint plague that is sweeping the planet known as the SDM (apparently this not an acronym for Simplistic and Dangerous Model). A second version is due to be published soon.

So why are SDMs such a bad tool for trying to model the spread of COVID-19?

1) The system is not at equilibrium

The first case of what has become known as COVID-19 was reported on

View original post 717 more words

Posted in Generalized Additive Models, non-parametric statistics, science, statistics | Leave a comment
Link | Posted on by | Leave a comment

Major Ocean Currents Drifting Polewards

Living on Earth, the environmental news program of Public Radio, featured Amy Bower, Senior Scientist at Woods Hole Oceanographic Institution, on 27th March 2020 to discuss new research from the Alfred Wegener Institute showing that major ocean currents are drifting northwards.

Dr Bower explained the research in this interview (for which there is also a transcript), and its implications for climate and fisheries.

Posted in science | Leave a comment

Keep fossil fuels in the ground

Ah, wouldn’t it be lovely!?

Is this the beginning of the Minsky Moment Mark Carney has feared? In short, that was because the trading markets had not priced in (a) the risks from climate change, and (b) the risks from fossil fuels being abandoned in favor of clean energy. Dr Carney has repeated expressed his concern that if impacted companies did not provide this risk information to investors, they would be blind to the impending change and what it might cost them. The Minsky Moment, then, would be a sudden awakening on the part of investors, acknowledging the risk, and, in a short time, pricing in its implications into the values of their holdings, causing a market crash.

Posted in Anthropocene, being carbon dioxide, catastrophe modeling, clean disruption, climate change, climate disruption, climate economics, climate policy, Cult of Carbon, fossil fuel divestment, fossil fuels, Mark Carney, Minsky moment | Leave a comment

“Lockdown WORKS”

Tamino favors LASSO LOESS and piecewise linear models. I favor splines, especially penalized smoothing splines via the R package pspline, using generalized cross validation to set the smoothing parameter. Tamino looks for breaks in the piecewise linear case to check for and test for significant changes. I use the first and higher derivatives of the spline.

Both methods are sound and good.

I don’t know how you might use a random forest regression for this purpose, but I bet there is a way. I doubt it is as good, though.

Open Mind

Over 2400 Americans died yesterday from Coronavirus. Here are the new deaths per day (“daily mortality”) in the USA since March 10, 2020 (note: this is an exponential plot)

View original post 241 more words

Posted in forecasting, penalized spline regression, science, splines, statistical regression, statistical series, statistics, time series | 1 Comment

“coronavirus counts do not count”

via coronavirus counts do not count

Quote | Posted on by | Leave a comment

Cloud brightening hits a salty snag

The proposal known as solar radiation management is complicated. It just got moreso. Released Wednesday:

Fossum, K.N., Ovadnevaite, J., Ceburnis, D. et al. “Sea-spray regulates sulfate cloud droplet activation over oceans“, Climate and Atmospheric Science, 3(14): (2020).
https://doi.org/10.1038/s41612-020-0116-2

[open access]

The above is an experimental essay regarding the effects of salt spray upon an artificial proposed technique to enhance the Twomey Effect.

Solar geoengineering attempts to mitigate some of the effects of climate disruption by fossil fuel emissions using various technologies to increase Earth’s albedo. While there are water droplet-based techniques, many involve injecting Sulphur-derived droplets into atmosphere at differing heights, because these droplets are bright. Were these to succeed, increasing albedo means that less solar radiation would reach Earth’s surface, thus cooling it. (See also.)

Assuming this remedy worked, meaning it controls increases in surface and oceanic temperatures, this will not solve ocean acidification, because carbonic acid and related concentrations in oceans will continue to increase, as long as emissions from fossil fuels and agriculture continue. Moreover, there are attendant risks: These particles fall out of atmosphere so they need to be continually replenished. There are concerns, as noted in a talk cited here by the late Professor Wally Broecker, that there may not be sufficient sulfur supply, or this may pinch the market for sulfur impacting prices of other products which demand it. If the replenishment were interrupted, say, by a war or a pandemic, there are concerns regarding the impulse response of a climate system where the blocking of radiation is suddenly released (“termination shock”).

Professor David Keith of Harvard University is a proponent of these measures. There are many, including Professor Ray Pierrehumbert, who are highly skeptical and suspicious (“albedo hacking”, originally due to Kintisch from 2010).

And there are also concerns this might not work, at least not well. And, the reasoning goes, if large scale impacts to humanity on the planet are to be safeguarded by betting on such a technique, especially if there’s a moral hazard that the true solution, stopping emissions from fossil fuel burning, will be hindered by the technology’s possibilities, it better be known to work, and work well.

One measure, cloud brightening, was dealt a blow by the paper cited at the top. It should be noted that

Horowitz, H. M., Holmes, C., Wright, A., Sherwen, T., Wang, X., Evans, M., et al. ( 2020). “Effects of sea salt aerosol emissions for marine cloud brightening on atmospheric chemistry: Implications for radiative forcing“, Geophysical Research Letters, 47,
e2019GL085838. https://doi.org/10.1029/2019GL085838

(Also open access)

also reported on interactions of sea salt with aerosols in the marine cloud brightening case.

The net, from the Abstract by Fossum, et al:

We present new experimental results from the remote Southern Ocean illustrating that, for a given updraft, the peak supersaturation reached in cloud, and consequently the number of droplets activated on sulfate nuclei, is strongly but inversely proportional to the concentration of sea-salt activated despite a 10-fold lower abundance. Greater sea-spray nuclei availability mostly suppresses sulfate aerosol activation leading to an overall decrease in cloud droplet concentrations; however, for high vertical updrafts and low sulfate aerosol availability, increased sea-spray can augment cloud droplet concentrations. This newly identified effect where sea-salt nuclei indirectly controls sulfate nuclei activation into cloud droplets could potentially lead to changes in the albedo of marine boundary layer clouds by as much as 30%.

The atmosphere is a complicated beastie.

Posted in adaptation, American Association for the Advancement of Science, American Meteorological Association, atmosphere, being carbon dioxide, carbon dioxide, chemistry, climate disruption, climate mitigation, climate nightmares, climate policy, cloud brightening, ecomodernism, emissions, geoengineering, global warming, Ken Caldeira, leaving fossil fuels in the ground, meteorological models, meteorology, mitigating climate disruption, NASA, National Center for Atmospheric Research, oceanography, Principles of Planetary Climate, Ray Pierrehumbert, risk, solar radiation management, sustainability, Wally Broecker, water vapor, wishful environmentalism, zero carbon | Leave a comment

Machiavelli

It’s right out of Machiavelli’s The Prince.

#covid_19 #coronavirus

Even for the Trump administration, it is odd they are pushing #Hydroxychloroquine and #Azithromycin so hard, against medical advice and evidence.

I’ve thought about this and, given the growing animosity between Trump and Navarro and the physicians, I think I know what’s going on. By vigorously advancing #Hydroxychloroquine and #Azithromycin, Trump, Giuliani, and Navarro are setting up the medical community and the HHS/CDC/FDA to be fall guys in case the #covid_19 virus produces far more fatalities than the administration’s projections. Note, by the way, that no reputable forecaster backs those projections up and the administration is refusing to reveal how they came up with them. For if they push these drugs whose efficacy is found to be zero or worse, they know the medical community will oppose their wholesale administration, and if the pandemic causes many more American deaths, they can claim that, well, if only the medical community didn’t oppose them so much maybe it wouldn’t have been so bad. In other words, it’s setting up so they don’t take the blame for the deaths, or at least can sow doubt among their core supporters.

Typical Trump administration. Typical Machiavelli.

Update, 2020-04-07

Government officials have reportedly been told not to contradict Trump in public regarding his support for hydroxychloroquine and azithromycin. The original report apparently came from Politico. In addition, Trump is directing some of his government medical staff to work using these drugs into the population, thereby diminishing the number of them actually contending with the pandemic.

Without surprise, the American medical community, as demonstrated by an advisory editorial in the Annals of Internal Medicine, continues to argue that using these drugs as prophylactic or treatment against COVID-19 is inadvisable and harmful.

There are also now concerns about method and data analysis regarding the paper which is most cited in support of using these drugs, including some questions about how p-values were calculated (they did not use Yate’s correction for the contingency tables). A critical scholar did a reanalysis using better methods and discovered, among other things, that a mixed-effects logistic regression1 fit poorly but suggested that improvement in outcomes results primarily because of the passage of time.

1 Using R‘s glmer function, “with chloroquine and time as fixed effects, and a random intercept per patient”.

Afterthought, 2020-04-07

Since hydroxychloroquine is used to treat Lupus, you would think that those afflicted with Lupus would have a statistically lower chance of getting COVID-19 were hydroxychloroquine a prophylactic for it. That threshold would need to be adjusted because Lupus itself is an autoimmune disease which weakens immune systems and makes its victims more susceptible to infection. However, there is no evidence that these patients are protected and, in fact, Lupus patients need to be even more careful about handwashing and distancing.

Posted in American Association for the Advancement of Science, American Statistical Association, anti-science, coronavirus, COVID-19, Machiavelli, SARS-CoV-2 | Leave a comment

Oldie and Goodie: `Testing a point Null Hypothesis: The irreconcilability of p-values and evidence’

A blog post by Professor Christian Robert mentioned a paper by Professors James Berger and Tom Sellke, which I downloaded several years back but never got around to reading.

J. O. Berger, T. M. Sellke, “Testing a point Null Hypothesis: The irreconcilability of p values and evidence”, Journal of the American Statistical Association, March 1987, 82(397), Theory and Methods, 112-122.

I even overlooked the paper when I lectured about the statement on statistical significance and p-values by the American Statistical Association at my former employer. But it’s great. Abstract is below.

Abstract

The problem of testing a point null hypothesis (or a “small interval” null hypothesis) is considered. Of interest is the relationship between the P value (or observed significance level) and conditional and Bayesian measures of evidence against the null hypothesis. Although one might presume that a small P value indicates the presence of strong evidence against the null, such is not necessarily the case. Expanding on earlier work [especially Edwards, Lindman, and Savage (1963) and Dickey (1977)], it is shown that actual evidence against a null (as measured, say, by posterior probability or comparative likelihood) can differ by on order of magnitude from the P value. For instance, data that yield a P value of .05, when testing a normal mean, result in a posterior probability of the null of at least .30 for any objective prior distribution. (“Objective” here means that equal prior weight is given the two hypotheses and that the prior is symmetric and nonincreasing away from the null; other definitions of “objective” will be seen to yield qualitatively similar results.) The overall conclusion is that P values can be highly misleading measures of the evidence provided by the data against the null hypothesis.

Posted in American Statistical Association, Bayes, Bayesian, p-value | Leave a comment