## So, today, a diversion …

##### (Updated 2020-07-06, see end.)

So, today I report on a diversion. This is a bit off the beaten path for me. It resulted from an “I wonder” comment by Rob Honeycutt at Tamino‘s “Open Mind” blog, this time offering a post concerning how predictive the party affiliation of a governor of a state might be for the recent performance of the state with respect to COVID-19 (titled “COVID-19: Red States Blue States Redux”).

Rob supposed:

I would think running the numbers by county would be more accurate, though I guess that’s a much bigger task. CA, for example, is obviously predominantly blue, but there are a lot of small red counties which are now seeing a dramatic rise in cases.

And I replied:

Sounds like a logistic regression problem, although I’d/I’ll do it as a classification problem for a random forest. Anyone know where I can get a tabulation of counties vs Red or Blue status? Otherwise I’ll dig around on the Five Thirty Eight GitHub and see if they have one.

Well I didn’t find anything pertinent at Five Thirty Eight, and I didn’t do it as a classification problem, deciding that after the fact.

I found data sources.

One is the tried-and-true Johns Hopkins Github of COVID-19 cases and deaths which I have used elsewhere here. I’ve also discussed sources of COVID-19 data in general.

A second is the U.S. Bureau of Economic Analysis and, specifically, their county level personal income and demographics dataset.

And, rounding out the collection, there is MIT’s ElectionLab which offers datasets with demographics showing county level results through the 2018 General Elections. Their data dictionary is here. This provided the missing classification of Red vs Blue counties.

However, this also lets me tout a basic principle of good data science and statistics. This is something I learned from the books and lectures of Professor Frank Harrell, Jr of Vanderbilt University in Nashville, TN. In particular, Professor Harrell (and others) emphasize(s) the damage premature dichotomization can do to a statistical analysis.

So, instead of categorizing counties as Red vs Blue, or performance on COVID-19 as good vs bad, I am trying to push everything onto ordered measures and, preferably, onto a continuum measure. Indeed, the more I practice data science, the more I am convinced that, in the setup of experiments or data gathering or analysis, if categorical variables or yes/no variables can be at all avoided, they should be. Rather than measuring whether an Internet customer bought something or not, measure the time they spent considering it! Interpreting the duration of dwell on an item is qualitatively different if they buy it or if they don’t. In the former case, they were willing to devote that amount of time investigating. In the latter, it might be they are meticulous and decided against, or it could be the response was too slow and discouraged them. But even in the meticulous case, it shows they were almost to the point of a buy.

So, for assessing how well a county is doing for COVID-19 or not, I took the counts of cases from the county for the last 45 days. I fit a (penalized) smoothing spline to that. I used the smoothing spline to calculate a first derivative for all the days I could. I took the means of those derivatives. And I made that the score of how poorly or well the county was doing. If the mean was positive, the number of cases is going up, and it is doing poorly. If the mean is negative, cases are decreasing.

This all said, why am I clearly drifting my analysis into the world of, oh, politics?

While Five Thirty Eight‘s Nate Silver might claim politics hasn’t as much data available as does sports, a lot of political data exist and are well vetted in comparison to environmental or, as I cringe, Internet data. So it is, in a sense, a joy to work with political data. Most real Internet data, with which in my professional career I have been most deeply associated, needs to be collected by means outside the Internet, and there are interesting investigations and datasets available if that’s done.

Still, a lot of the counts and data pertaining to COVID-19 cases and deaths are fuzzy. We don’t necessarily have a reputable tally of sources in all states of, say, the United States, let alone counties, and let alone countries outside the United States. Assessing reputation depends upon knowing the commitment of authorities to presenting counts aligned with time, authorities which want to be transparent (some don’t: I know, that’s shocking), and knowing that rules for reporting are consistent or efforts are made to allow them to be made consistent even if they are changed. Political authorities tend, from my perspective and for the most part, want to do as little as possible and expend the least budget as possible to observe minimal reporting goals, if that.

The core idea was to develop a rich matrix of predictors or covariates consisting of demographic and other data pertaining to United States counties and match that with a response variable which expresses the overall (mean) slope of increases in COVID-19 cases. While I know counts of cases of COVID-19 aren’t the best measure, with case density per test or population estimates from random sampling being better, it is a widely available measure, and is the measure Tamino chose to use in his original post, namely cases per day per million population. Then, once matched, do a regression of response against predictors and try to develop a model which can be used for variable selection, that is, determining those covariates which are most predictive of increase in COVID-19 cases.

I chose to use a random forests modeling approach, on purely numerical outcome and predictors. The outcome was, as described above, taking the last 45 days of confirmed COVID-19 counts per county, and fitting them with a smoothing spline. Then the first (time) derivative of that spline was calculated per day of the last 45 days. Then the mean value of the first (time) derivative or slope was calculated. That was used as a response. However, the response was not calculated based upon raw counts of cases. Rather, the re-scaled or, as sometimes described, standardized value of case counts over time was used, lest the greater number of people in counties bias their experiences as being more representative. After all, the point is to get a per county picture.

Rescaling was done to covariates, too. In this case this had a beneficial effect: Some of the 3083(*) counties had missing values for some of covariates. Tossing them would have reduced the sample size and imposed a bias because quite a few counties were missing the value for at least one covariate. By rescaling, the mean of the rescaled values became zero, so missing values could be replaced by zeros, and this was equivalent to imputing copies of the mean for the missing values.

What were the covariates? They are described in the data dictionary, but I added a couple more. Here’s the complete list:

trump16
Presidential candidate Trump vote total in 2016.
clinton16
Presidential candidate Clinton vote total in 2016.
otherpres16
Presidential candidate vote total in 2016 for non-major party.
romney12
Presidential candidate Romney vote total in 2012.
obama12
Presidential candidate Obama vote total in 2012.
demsen16
Democratic Senate candidate vote total in 2016.
repsen16
Republican Senate candidate vote total in 2016.
othersen16
Non-major parties Senate vote total in 2016.
demhouse16
Democratic House candidate vote total in 2016.
rephouse16
Republican House candidate vote total in 2016.
otherhouse16
Non-major parties candidate vote total in 2016.
total_population
Total population as of 2016.
white_pct
Non-Hispanic whites as a percentage of total population, 2016.
black_pct
Non-Hispanic blacks as a percentage of total population, 2016.
hispanic_pct
Hispanics or Latinos as a percentage of total population, 2016.
nonwhite_pct
Non-whites as a percentage of total population, 2016.
foreignborn_pct
Foreign-born population as a percentage of total population.
female_pct
Females as a percentage of total population, 2016.
age29andunder_pct
Population 29 years or under as a percentage of total population, 2016.
age65andolder_pct
Population 65 years or older as a percentage of total population, 2016.
median_hh_inc
Median household income in the past 12 months (in 2016 inflation-adjusted dollars), 2016.
clf_unemploy_pct
Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
lesshs_pct
Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
lesscollege_pct
Population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
lesshs_whites_pct
White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
lesscollege_whites_pct
White population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
rural_pct
Rural population as a percentage of total population, 2016.
ruralurban_cc
Rural-urban continuum codes. See data dictionary for table of specific code values. These run as positive integers from 1 to 9, with 1 representing, for example, “Counties in metro areas of 1 million population or more” and 9 representing “Completely rural or less than 2,500 urban population, adjacent to a metro area”. These are not continuous values, but they are ordered.
trump.clinton.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county in 2016 for presidential candidate Trump to the sum of Clinton and non-major party candidates. Clinton votes are summed with non-majors because there are no votes in some counties for Clinton.
trump.obama.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012. Obama votes are summed with non-majors because there are no votes in some counties for Obama.
senate.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican Senate candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
house.ratio
This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican House candidates in the county in 2016 to the sum of votes for Democratic and non-major party candidates. Democratic candidates are summed with non-major parties because there are no votes for Democrats in some counties.
This is from BEA government data. This is the sum of income of residents of the county divided by the number of residents in the county. See the documentation for methodology.
2018 data.
RankInState
This is from BEA government data. This is the rank of total personal income of residents of the county over all counties in the state to which they are part. Number 1 is the county with the biggest income. 2018 data.
PctChgFrom2017
This is the percent change in personal income of residents in a county for 2018 relative to 2017.
##### (*) Different sources have differing number of counties, and these need to be reconciled. Also, some sources, such as the BEA data, have sub-tabulations to regions and states. There are also errors. In the MIT data, Virginia for example is listed twice, with different values for, e.g., median household income. Finally, spellings of county names need to be reconciled. A few counties in one list are missing in others. 3083 is the number of common counties between the datasets.

Wright and Ziegler‘s ranger package of R was used to explore these data and perform a regression. Varying numbers of trees were tried, and the final result was based upon a forest of 40,000 trees. Two cores were allocated on a 3.2 GHz 4-core Windows 7 system. Up to 6 variables were considered for splitting at each node. The minimum node size was 5, meaning there had to be at least 5 cases passing through each node. The splitting rule was “extra trees”, based upon the option supporting

Geurts, P., Ernst, D., Wehenkel, L. (2006). Extremely randomized trees. Machine Learning 63:3-42. https://doi.org/10.1007/s10994-006-6226-1.

10 random splits were considered for each splitting variable.

There were actually two phases of random forest fitting performed. The first was done to ascertain covariate importance. The second was a regression. Both used 40,000 trees.

In the first phase, permutation importance was used, with importance scaled and local importance values calculated as in:

Breiman, L. (2001). Random forests. Machine Learning, 45:5-32. https://doi.org/10.1023/A:1010933404324.

Permutation importance was used because this was a regression application. The resulting normalized importance scores sorted in decreasing order of importance were:

###### normalized variable importance demhouse16 black_pct othersen16 trump.obama.ratio demsen16 repsen16 0.077281 0.069349 0.056643 0.053579 0.051434 0.049935 senate.ratio otherhouse16 lesshs_whites_pct lesshs_pct hispanic_pct rephouse16 0.044154 0.041854 0.037017 0.035589 0.033906 0.032685 nonwhite_pct white_pct obama12 rural_pct clf_unemploy_pct PerCapitaDollars 0.030671 0.030221 0.028720 0.025665 0.025534 0.025463 ruralurban_cc RankInState clinton16 otherpres16 median_hh_inc trump16 0.024347 0.024095 0.023788 0.020571 0.020259 0.018410 foreignborn_pct total_population romney12 lesscollege_whites_pct lesscollege_pct age65andolder_pct 0.016596 0.016517 0.016443 0.015623 0.015530 0.012926 age29andunder_pct PctChgFrom2017 female_pct trump.clinton.ratio house.ratio 0.010310 0.008777 0.006107 0.000000 0.000000

Covariates below the 0.02 quantile were discarded from further consideration. This eliminated female_pct, trump.clinton.ratio, and house.ratio.

Regression was done with the remaining. The fit was barely adequate(**), with an $R^{2} = 0.536$, and a root-mean-squared error of 4.64. Here is the original mass density of response values:

Recall these were based upon slopes derived from centered and scaled case counts:

 scaledDerivativesFrom<- function(setOfSeries) { # Expects cases to be in columns, and the series to be column vectors, that is, # the successive time points are in successive rows. nr<- nrow(setOfSeries) nc<- ncol(setOfSeries) tocks<- 1:nc scaledSetOfSeries<- scale(setOfSeries) centers<- attr(scaledSetOfSeries, "scaled:center") scalings<- attr(scaledSetOfSeries, "scaled:scale") fdFromSeries<- suppressWarnings(fdata(mdata=setOfSeries, argvals=tocks)) splining<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=0, method="bspline", class.out="fdata", nbasis=basisSize(nr))) fst<- suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=1, method="bspline", class.out="fdata", nbasis=basisSize(nr))) fstDerivatives<- colMeans(fst\$data) return(list(fst=fstDerivatives, ctr=centers, sca=scalings)) } 

$\text{fdata}$ and $\text{fdata.deriv}$ are from the fda.usc package.

After this preparation, the point of regression was to determine which covariates contribute to increases in COVID-19 prevalence and which reduce it. In order to determine these, a baseline regression was performed with all rescaled covariates set to their mean values, that is, zeros. Then, because of the scaling, a variation in single standard deviation up or down was increasing or decreasing a covariate value by a unit. Accordingly, two predictions of response was done with all covariates zero apart from a single covariate of interest. All trees in the regression forest were used for these predictions. One prediction was done with the covariate at plus one-half and the other with the covariate at minus one-half. Responses were recorded in each case.

If these responses for each covariate were arranged as minus one-half, baseline, and plus one-half, there are two cases where the responses were monotonic, either monotonically increasing, or monotonically decreasing. If the responses increased monotonically, that meant the covariate was contributing to the increase in COVID-19 incidence. If the responses decreased monotonically, that meant the covariate was impeding growth in COVID-19. All other cases were interpreted as equivocal. A full listing of the responses is given below. There were a couple of covariates which had no significant effect upon response.

The code and datasets for this analysis are available here. I did not place these in my Github because some of the datasets and results (including the random forests) are pretty large.

##### (**) Slightly more than half of the variance was explained. This certainly is not good enough for prediction forecasting.

In the conclusion most of the covariates did not show a monotonic response. Of those which did, some had surprising effects upon response, contrary to what would be thought. In all cases, it shows that whatever is happening at state levels, as Tamino originally hypothesized, the story at county levels are far more nuanced.

According to the regression, the increases in the following covariates are associated with increases in COVID-19 incidence:

• Number of votes for non-major party presidential candidates in 2016.
• Number of votes for non-major party House candidates in 2016.
• Hispanics or Latinos as a percentage of total population, 2016.
• Personal per capita income, 2018.
• Percent change in per capita income from 2017.

According to the regression, increases in the following covariates are associated with decreases in COVID-19 incidence:

• Number of votes for Democratic party House candidates in 2016.
• Non-Hispanic blacks as a percentage of total population, 2016.
• Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
• Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
• White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
• The ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and non-major party candidates in 2012.

The remainder have non-monotonic effects upon response, with the exception of votes for Trump in 2016, votes for Romney in 2012, and the ratio of Trump votes to sum of Clinton and non-major party Presidential candidates in 2016. The latter three had no significant effect upon response.

These are difficult to interpret, certainly from any common wisdom conventional frame. A story could be told about propensity for supporting non-majority political parties as contrarians. A story could be told about wealth and especially recent wealth. And a more conventional story might be told about support for Democratic House candidates. But the analysis is too weak to build much of a case here. For why is county historical support for Trump over Obama and others associated with decreases in COVID-19?

Another analysis, perhaps using a Bayesian additive regression trees or a generalized additive model might reveal more. I have included the result of a straight regression at the bottom of the post.

The evidence indicates that factors contributing to increases in COVID-19 across counties are more complicated than state political trends suggest.

The table of predicted responses for all regressed covariates is:

###### decrease covariate baseline increase covariate trump16 0.585751997130 0.63150294777 0.5393147721024 clinton16 0.636470821891 0.63150294777 0.9463412693325 otherpres16 0.035647289876 0.63150294777 1.2041816952557 romney12 0.559379728900 0.63150294777 0.5688396494840 obama12 0.743567621794 0.63150294777 0.9029509887327 demsen16 0.195284466985 0.63150294777 0.3834763150956 repsen16 0.563399591536 0.63150294777 -0.5634050367543 othersen16 0.482441423494 0.63150294777 -0.2858281838729 demhouse16 1.742885286145 0.63150294777 0.5681492070191 rephouse16 0.390443844733 0.63150294777 0.5296431047217 otherhouse16 0.277166428927 0.63150294777 0.6911570905665 total_population 0.485374266285 0.63150294777 0.4872532869708 white_pct 0.555369233414 0.63150294777 0.3409851865292 black_pct 1.458426115793 0.63150294777 -1.2545806208186 hispanic_pct 0.203970849739 0.63150294777 0.7626913277465 nonwhite_pct 0.306006353575 0.63150294777 0.5322087600115 foreignborn_pct 0.539705099022 0.63150294777 0.4336693730927 female_pct 0.610347729825 0.63150294777 0.4323089313331 age29andunder_pct 0.472453265562 0.63150294777 0.5049536737941 age65andolder_pct 0.403374548917 0.63150294777 0.5885353905113 median_hh_inc 0.456473898331 0.63150294777 0.6011229846262 clf_unemploy_pct 0.761125923726 0.63150294777 -0.0129693833895 lesshs_pct 0.770957215281 0.63150294777 -0.0098785432455 lesscollege_pct 0.556661896877 0.63150294777 0.3748009374647 lesshs_whites_pct 0.749246848692 0.63150294777 -0.1558774570007 lesscollege_whites_pct 0.521495353443 0.63150294777 0.4123318092143 rural_pct 0.602826246670 0.63150294777 0.2588820550810 ruralurban_cc 0.297837745133 0.63150294777 0.3138831021115 trump.clinton.ratio 0.631502947772 0.63150294777 0.6315029477723 trump.obama.ratio 1.037253853907 0.63150294777 0.2651589870262 senate.ratio -0.103308756239 0.63150294777 -0.0583507114855 house.ratio 0.631502947772 0.63150294777 0.6315029477723 PerCapitaDollars -0.111339953111 0.63150294777 0.7183689921657 RankInState 0.439511214174 0.63150294777 0.5414556938257 PctChgFrom2017 -0.525939132513 0.63150294777 0.6810736843380

The following is a result of a linear regression with intercept. Note that the $R^{2}$ is poorer than with the random forest.

##### Update, 2020-07-06

In the assessment above, I mentioned using BARTs and GAMs as tools for investigating why

• Number of votes for non-major party presidential candidates in 2016.
• Number of votes for non-major party House candidates in 2016.
• Hispanics or Latinos as a percentage of total population, 2016.
• Personal per capita income, 2018.
• Percent change in per capita income from 2017.

are associated with increase in COVID-19. After thinking on it for a second day, when I devote time to this problem again, I may instead look at determining the causal effect, if any, of these factors upon increases in COVID-19 incidence, perhaps using the methods of propensity scoring or others. In particular, there are methods in the R package TWANG.