(Updated 20200706, see end.)
So, today I report on a diversion. This is a bit off the beaten path for me. It resulted from an “I wonder” comment by Rob Honeycutt at Tamino‘s “Open Mind” blog, this time offering a post concerning how predictive the party affiliation of a governor of a state might be for the recent performance of the state with respect to COVID19 (titled “COVID19: Red States Blue States Redux”).
Rob supposed:
I would think running the numbers by county would be more accurate, though I guess that’s a much bigger task. CA, for example, is obviously predominantly blue, but there are a lot of small red counties which are now seeing a dramatic rise in cases.
And I replied:
Sounds like a logistic regression problem, although I’d/I’ll do it as a classification problem for a random forest. Anyone know where I can get a tabulation of counties vs Red or Blue status? Otherwise I’ll dig around on the Five Thirty Eight GitHub and see if they have one.
Well I didn’t find anything pertinent at Five Thirty Eight, and I didn’t do it as a classification problem, deciding that after the fact.
I found data sources.
One is the triedandtrue Johns Hopkins Github of COVID19 cases and deaths which I have used elsewhere here. I’ve also discussed sources of COVID19 data in general.
A second is the U.S. Bureau of Economic Analysis and, specifically, their county level personal income and demographics dataset.
And, rounding out the collection, there is MIT’s ElectionLab which offers datasets with demographics showing county level results through the 2018 General Elections. Their data dictionary is here. This provided the missing classification of Red vs Blue counties.
However, this also lets me tout a basic principle of good data science and statistics. This is something I learned from the books and lectures of Professor Frank Harrell, Jr of Vanderbilt University in Nashville, TN. In particular, Professor Harrell (and others) emphasize(s) the damage premature dichotomization can do to a statistical analysis.
So, instead of categorizing counties as Red vs Blue, or performance on COVID19 as good vs bad, I am trying to push everything onto ordered measures and, preferably, onto a continuum measure. Indeed, the more I practice data science, the more I am convinced that, in the setup of experiments or data gathering or analysis, if categorical variables or yes/no variables can be at all avoided, they should be. Rather than measuring whether an Internet customer bought something or not, measure the time they spent considering it! Interpreting the duration of dwell on an item is qualitatively different if they buy it or if they don’t. In the former case, they were willing to devote that amount of time investigating. In the latter, it might be they are meticulous and decided against, or it could be the response was too slow and discouraged them. But even in the meticulous case, it shows they were almost to the point of a buy.
So, for assessing how well a county is doing for COVID19 or not, I took the counts of cases from the county for the last 45 days. I fit a (penalized) smoothing spline to that. I used the smoothing spline to calculate a first derivative for all the days I could. I took the means of those derivatives. And I made that the score of how poorly or well the county was doing. If the mean was positive, the number of cases is going up, and it is doing poorly. If the mean is negative, cases are decreasing.
This all said, why am I clearly drifting my analysis into the world of, oh, politics?
While Five Thirty Eight‘s Nate Silver might claim politics hasn’t as much data available as does sports, a lot of political data exist and are well vetted in comparison to environmental or, as I cringe, Internet data. So it is, in a sense, a joy to work with political data. Most real Internet data, with which in my professional career I have been most deeply associated, needs to be collected by means outside the Internet, and there are interesting investigations and datasets available if that’s done.
Still, a lot of the counts and data pertaining to COVID19 cases and deaths are fuzzy. We don’t necessarily have a reputable tally of sources in all states of, say, the United States, let alone counties, and let alone countries outside the United States. Assessing reputation depends upon knowing the commitment of authorities to presenting counts aligned with time, authorities which want to be transparent (some don’t: I know, that’s shocking), and knowing that rules for reporting are consistent or efforts are made to allow them to be made consistent even if they are changed. Political authorities tend, from my perspective and for the most part, want to do as little as possible and expend the least budget as possible to observe minimal reporting goals, if that.
The core idea was to develop a rich matrix of predictors or covariates consisting of demographic and other data pertaining to United States counties and match that with a response variable which expresses the overall (mean) slope of increases in COVID19 cases. While I know counts of cases of COVID19 aren’t the best measure, with case density per test or population estimates from random sampling being better, it is a widely available measure, and is the measure Tamino chose to use in his original post, namely cases per day per million population. Then, once matched, do a regression of response against predictors and try to develop a model which can be used for variable selection, that is, determining those covariates which are most predictive of increase in COVID19 cases.
I chose to use a random forests modeling approach, on purely numerical outcome and predictors. The outcome was, as described above, taking the last 45 days of confirmed COVID19 counts per county, and fitting them with a smoothing spline. Then the first (time) derivative of that spline was calculated per day of the last 45 days. Then the mean value of the first (time) derivative or slope was calculated. That was used as a response. However, the response was not calculated based upon raw counts of cases. Rather, the rescaled or, as sometimes described, standardized value of case counts over time was used, lest the greater number of people in counties bias their experiences as being more representative. After all, the point is to get a per county picture.
Rescaling was done to covariates, too. In this case this had a beneficial effect: Some of the 3083^{(*)} counties had missing values for some of covariates. Tossing them would have reduced the sample size and imposed a bias because quite a few counties were missing the value for at least one covariate. By rescaling, the mean of the rescaled values became zero, so missing values could be replaced by zeros, and this was equivalent to imputing copies of the mean for the missing values.
What were the covariates? They are described in the data dictionary, but I added a couple more. Here’s the complete list:
 trump16
 Presidential candidate Trump vote total in 2016.
 clinton16
 Presidential candidate Clinton vote total in 2016.
 otherpres16
 Presidential candidate vote total in 2016 for nonmajor party.
 romney12
 Presidential candidate Romney vote total in 2012.
 obama12
 Presidential candidate Obama vote total in 2012.
 demsen16
 Democratic Senate candidate vote total in 2016.
 repsen16
 Republican Senate candidate vote total in 2016.
 othersen16
 Nonmajor parties Senate vote total in 2016.
 demhouse16
 Democratic House candidate vote total in 2016.
 rephouse16
 Republican House candidate vote total in 2016.
 otherhouse16
 Nonmajor parties candidate vote total in 2016.
 total_population
 Total population as of 2016.
 white_pct
 NonHispanic whites as a percentage of total population, 2016.
 black_pct
 NonHispanic blacks as a percentage of total population, 2016.
 hispanic_pct
 Hispanics or Latinos as a percentage of total population, 2016.
 nonwhite_pct
 Nonwhites as a percentage of total population, 2016.
 foreignborn_pct
 Foreignborn population as a percentage of total population.
 female_pct
 Females as a percentage of total population, 2016.
 age29andunder_pct
 Population 29 years or under as a percentage of total population, 2016.
 age65andolder_pct
 Population 65 years or older as a percentage of total population, 2016.
 median_hh_inc
 Median household income in the past 12 months (in 2016 inflationadjusted dollars), 2016.
 clf_unemploy_pct
 Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
 lesshs_pct
 Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
 lesscollege_pct
 Population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
 lesshs_whites_pct
 White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
 lesscollege_whites_pct
 White population with an education of less than a bachelor’s degree as a percentage of total population, 2016.
 rural_pct
 Rural population as a percentage of total population, 2016.
 ruralurban_cc
 Ruralurban continuum codes. See data dictionary for table of specific code values. These run as positive integers from 1 to 9, with 1 representing, for example, “Counties in metro areas of 1 million population or more” and 9 representing “Completely rural or less than 2,500 urban population, adjacent to a metro area”. These are not continuous values, but they are ordered.
 trump.clinton.ratio
 This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county in 2016 for presidential candidate Trump to the sum of Clinton and nonmajor party candidates. Clinton votes are summed with nonmajors because there are no votes in some counties for Clinton.
 trump.obama.ratio
 This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and nonmajor party candidates in 2012. Obama votes are summed with nonmajors because there are no votes in some counties for Obama.
 senate.ratio
 This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican Senate candidates in the county in 2016 to the sum of votes for Democratic and nonmajor party candidates. Democratic candidates are summed with nonmajor parties because there are no votes for Democrats in some counties.
 house.ratio
 This was a new covariated introduced. Trying to normalize away county population size, this is the ratio of votes for Republican House candidates in the county in 2016 to the sum of votes for Democratic and nonmajor party candidates. Democratic candidates are summed with nonmajor parties because there are no votes for Democrats in some counties.
 PerCapitaDollars

This is from BEA government data. This is the sum of income of residents of the county divided by the number of residents in the county. See the documentation for methodology.
2018 data.  RankInState
 This is from BEA government data. This is the rank of total personal income of residents of the county over all counties in the state to which they are part. Number 1 is the county with the biggest income. 2018 data.
 PctChgFrom2017
 This is the percent change in personal income of residents in a county for 2018 relative to 2017.
^{(*)} Different sources have differing number of counties, and these need to be reconciled. Also, some sources, such as the BEA data, have subtabulations to regions and states. There are also errors. In the MIT data, Virginia for example is listed twice, with different values for, e.g., median household income. Finally, spellings of county names need to be reconciled. A few counties in one list are missing in others. 3083 is the number of common counties between the datasets.
Wright and Ziegler‘s ranger package of R was used to explore these data and perform a regression. Varying numbers of trees were tried, and the final result was based upon a forest of 40,000 trees. Two cores were allocated on a 3.2 GHz 4core Windows 7 system. Up to 6 variables were considered for splitting at each node. The minimum node size was 5, meaning there had to be at least 5 cases passing through each node. The splitting rule was “extra trees”, based upon the option supporting
Geurts, P., Ernst, D., Wehenkel, L. (2006). Extremely randomized trees. Machine Learning 63:342. https://doi.org/10.1007/s1099400662261.
10 random splits were considered for each splitting variable.
There were actually two phases of random forest fitting performed. The first was done to ascertain covariate importance. The second was a regression. Both used 40,000 trees.
In the first phase, permutation importance was used, with importance scaled and local importance values calculated as in:
Breiman, L. (2001). Random forests. Machine Learning, 45:532. https://doi.org/10.1023/A:1010933404324.
Permutation importance was used because this was a regression application. The resulting normalized importance scores sorted in decreasing order of importance were:
normalized variable importance  

demhouse16  black_pct  othersen16  trump.obama.ratio  demsen16  repsen16 
0.077281  0.069349  0.056643  0.053579  0.051434  0.049935 
senate.ratio  otherhouse16  lesshs_whites_pct  lesshs_pct  hispanic_pct  rephouse16 
0.044154  0.041854  0.037017  0.035589  0.033906  0.032685 
nonwhite_pct  white_pct  obama12  rural_pct  clf_unemploy_pct  PerCapitaDollars 
0.030671  0.030221  0.028720  0.025665  0.025534  0.025463 
ruralurban_cc  RankInState  clinton16  otherpres16  median_hh_inc  trump16 
0.024347  0.024095  0.023788  0.020571  0.020259  0.018410 
foreignborn_pct  total_population  romney12  lesscollege_whites_pct  lesscollege_pct  age65andolder_pct 
0.016596  0.016517  0.016443  0.015623  0.015530  0.012926 
age29andunder_pct  PctChgFrom2017  female_pct  trump.clinton.ratio  house.ratio  
0.010310  0.008777  0.006107  0.000000  0.000000 
Covariates below the 0.02 quantile were discarded from further consideration. This eliminated female_pct, trump.clinton.ratio, and house.ratio.
Regression was done with the remaining. The fit was barely adequate^{(**)}, with an , and a rootmeansquared error of 4.64. Here is the original mass density of response values:
Recall these were based upon slopes derived from centered and scaled case counts:
scaledDerivativesFrom< function(setOfSeries)
{
# Expects cases to be in columns, and the series to be column vectors, that is,
# the successive time points are in successive rows.
nr< nrow(setOfSeries)
nc< ncol(setOfSeries)
tocks< 1:nc
scaledSetOfSeries< scale(setOfSeries)
centers< attr(scaledSetOfSeries, "scaled:center")
scalings< attr(scaledSetOfSeries, "scaled:scale")
fdFromSeries< suppressWarnings(fdata(mdata=setOfSeries, argvals=tocks))
splining< suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=0, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fst< suppressWarnings(fdata.deriv(fdataobj=fdFromSeries, nderiv=1, method="bspline",
class.out="fdata", nbasis=basisSize(nr)))
fstDerivatives< colMeans(fst$data)
return(list(fst=fstDerivatives, ctr=centers, sca=scalings))
}
and are from the fda.usc package.
After this preparation, the point of regression was to determine which covariates contribute to increases in COVID19 prevalence and which reduce it. In order to determine these, a baseline regression was performed with all rescaled covariates set to their mean values, that is, zeros. Then, because of the scaling, a variation in single standard deviation up or down was increasing or decreasing a covariate value by a unit. Accordingly, two predictions of response was done with all covariates zero apart from a single covariate of interest. All trees in the regression forest were used for these predictions. One prediction was done with the covariate at plus onehalf and the other with the covariate at minus onehalf. Responses were recorded in each case.
If these responses for each covariate were arranged as minus onehalf, baseline, and plus onehalf, there are two cases where the responses were monotonic, either monotonically increasing, or monotonically decreasing. If the responses increased monotonically, that meant the covariate was contributing to the increase in COVID19 incidence. If the responses decreased monotonically, that meant the covariate was impeding growth in COVID19. All other cases were interpreted as equivocal. A full listing of the responses is given below. There were a couple of covariates which had no significant effect upon response.
The code and datasets for this analysis are available here. I did not place these in my Github because some of the datasets and results (including the random forests) are pretty large.
^{(**)} Slightly more than half of the variance was explained. This certainly is not good enough for prediction forecasting.
In the conclusion most of the covariates did not show a monotonic response. Of those which did, some had surprising effects upon response, contrary to what would be thought. In all cases, it shows that whatever is happening at state levels, as Tamino originally hypothesized, the story at county levels are far more nuanced.
According to the regression, the increases in the following covariates are associated with increases in COVID19 incidence:
 Number of votes for nonmajor party presidential candidates in 2016.
 Number of votes for nonmajor party House candidates in 2016.
 Hispanics or Latinos as a percentage of total population, 2016.
 Personal per capita income, 2018.
 Percent change in per capita income from 2017.
According to the regression, increases in the following covariates are associated with decreases in COVID19 incidence:
 Number of votes for Democratic party House candidates in 2016.
 NonHispanic blacks as a percentage of total population, 2016.
 Unemployed population in labor force as a percentage of total population in civilian labor force, 2016.
 Population with an education of less than a regular high school diploma as a percentage of total population, 2016.
 White population with an education of less than a regular high school diploma as a percentage of total population, 2016.
 The ratio of votes in the county for presidential candidate Trump in 2016 to the sum of Obama and nonmajor party candidates in 2012.
The remainder have nonmonotonic effects upon response, with the exception of votes for Trump in 2016, votes for Romney in 2012, and the ratio of Trump votes to sum of Clinton and nonmajor party Presidential candidates in 2016. The latter three had no significant effect upon response.
These are difficult to interpret, certainly from any common wisdom conventional frame. A story could be told about propensity for supporting nonmajority political parties as contrarians. A story could be told about wealth and especially recent wealth. And a more conventional story might be told about support for Democratic House candidates. But the analysis is too weak to build much of a case here. For why is county historical support for Trump over Obama and others associated with decreases in COVID19?
Another analysis, perhaps using a Bayesian additive regression trees or a generalized additive model might reveal more. I have included the result of a straight regression at the bottom of the post.
The evidence indicates that factors contributing to increases in COVID19 across counties are more complicated than state political trends suggest.
The table of predicted responses for all regressed covariates is:
decrease covariate  baseline  increase covariate  

trump16  0.585751997130  0.63150294777  0.5393147721024 
clinton16  0.636470821891  0.63150294777  0.9463412693325 
otherpres16  0.035647289876  0.63150294777  1.2041816952557 
romney12  0.559379728900  0.63150294777  0.5688396494840 
obama12  0.743567621794  0.63150294777  0.9029509887327 
demsen16  0.195284466985  0.63150294777  0.3834763150956 
repsen16  0.563399591536  0.63150294777  0.5634050367543 
othersen16  0.482441423494  0.63150294777  0.2858281838729 
demhouse16  1.742885286145  0.63150294777  0.5681492070191 
rephouse16  0.390443844733  0.63150294777  0.5296431047217 
otherhouse16  0.277166428927  0.63150294777  0.6911570905665 
total_population  0.485374266285  0.63150294777  0.4872532869708 
white_pct  0.555369233414  0.63150294777  0.3409851865292 
black_pct  1.458426115793  0.63150294777  1.2545806208186 
hispanic_pct  0.203970849739  0.63150294777  0.7626913277465 
nonwhite_pct  0.306006353575  0.63150294777  0.5322087600115 
foreignborn_pct  0.539705099022  0.63150294777  0.4336693730927 
female_pct  0.610347729825  0.63150294777  0.4323089313331 
age29andunder_pct  0.472453265562  0.63150294777  0.5049536737941 
age65andolder_pct  0.403374548917  0.63150294777  0.5885353905113 
median_hh_inc  0.456473898331  0.63150294777  0.6011229846262 
clf_unemploy_pct  0.761125923726  0.63150294777  0.0129693833895 
lesshs_pct  0.770957215281  0.63150294777  0.0098785432455 
lesscollege_pct  0.556661896877  0.63150294777  0.3748009374647 
lesshs_whites_pct  0.749246848692  0.63150294777  0.1558774570007 
lesscollege_whites_pct  0.521495353443  0.63150294777  0.4123318092143 
rural_pct  0.602826246670  0.63150294777  0.2588820550810 
ruralurban_cc  0.297837745133  0.63150294777  0.3138831021115 
trump.clinton.ratio  0.631502947772  0.63150294777  0.6315029477723 
trump.obama.ratio  1.037253853907  0.63150294777  0.2651589870262 
senate.ratio  0.103308756239  0.63150294777  0.0583507114855 
house.ratio  0.631502947772  0.63150294777  0.6315029477723 
PerCapitaDollars  0.111339953111  0.63150294777  0.7183689921657 
RankInState  0.439511214174  0.63150294777  0.5414556938257 
PctChgFrom2017  0.525939132513  0.63150294777  0.6810736843380 
The following is a result of a linear regression with intercept. Note that the is poorer than with the random forest.
Update, 20200706
In the assessment above, I mentioned using BARTs and GAMs as tools for investigating why
 Number of votes for nonmajor party presidential candidates in 2016.
 Number of votes for nonmajor party House candidates in 2016.
 Hispanics or Latinos as a percentage of total population, 2016.
 Personal per capita income, 2018.
 Percent change in per capita income from 2017.
are associated with increase in COVID19. After thinking on it for a second day, when I devote time to this problem again, I may instead look at determining the causal effect, if any, of these factors upon increases in COVID19 incidence, perhaps using the methods of propensity scoring or others. In particular, there are methods in the R package TWANG.
Pingback: Political and demographic associations with changes in COVID19 rates at the tails  hypergeometric